From mkparam at gmail.com  Sat Sep  1 04:23:25 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Sat, 1 Sep 2012 09:53:25 +0530
Subject: [Linux-cluster] Services getting stuck on node
In-Reply-To: <F7D01032076B4C4BAFA5A43ECD5DC50C24C1FE@EDI2EXMBX04.ioinc.ioroot.tld>
References: <F7D01032076B4C4BAFA5A43ECD5DC50C24C1FE@EDI2EXMBX04.ioinc.ioroot.tld>
Message-ID: <CAA1zgjZ0TZ4QfTb857e9-xToOT8e0WobE4ReQuEiLkJEiDanJA@mail.gmail.com>

Hi

 I just started using Redhat Cluster two weeks ago so i don't claim myself
an expert.

 Looking at this error, i can recommend you to look at
/var/log/cluster/fenced.log and also try commands like  "fence_tool ls ,
fence_tool dump" and look at the output if it returns any error.
Alternately, if you have time to investigate, do "service stop rgmanager"
and make sure it does not run, and try starting in the foreground as
"rgmanager -f" and see what it reports when you can simulate the same
scenario.

 Other than that, your /var/log/messages and /var/log/cluster/*.log files
must tell you something going on.

Param

On Sat, Sep 1, 2012 at 4:03 AM, Colin Simpson <Colin.Simpson at iongeo.com>wrote:

> Hi
>
> I had a strange issue this afternoon. One of my cluster nodes died
> (possible hw fault or driver issue). But the other node failed to take a
> number of it's services (2 node cluster), when it was successfully fenced.
>
> The clustat indicated that the services were on still on the original node
> (started) but the top lines correctly stated that the node was "offline".
>  The rgmanager log says for this event:
>
> Aug 31 17:19:30 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:30 rgmanager [ip] Local ping to 10.10.1.45 succeeded
> Aug 31 17:19:37 rgmanager State change: bld1uxn1i DOWN
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.46, Level 10
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.45, Level 0
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.33, Level 0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.46 present on bond0
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.43, Level 0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.45 present on bond0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.33 present on bond0
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.43 present on bond0
> Aug 31 17:19:49 rgmanager Taking over service service:nfsdprj from down
> member bld1uxn1i
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager #47: Failed changing service status
> Aug 31 17:19:49 rgmanager Taking over service service:httpd from down
> member bld1uxn1i
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager #47: Failed changing service status
> Aug 31 17:19:49 rgmanager [ip] Local ping to 10.10.1.46 succeeded
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop
> cleanly
> Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
> A couple of other services did successfully switch after this.
>
> I have seem this a few times (randomly) on various clusters since around
> the time of upgrading to 6.3 from 6.2 (services refusing to cleanly stop on
> a node). It's hard to reproduce and when down we usually just want a
> restart as fast as possible (thereby limiting time for debugging).
>
> How can I see what is causing the "#47: Failed changing service status" or
> any more debugging we can turn on in rgmanager to help with this?
>
> Or better still has anyone else seen anything like this?
>
> Thanks
>
> Colin
>
> ________________________________
>
>
> This email and any files transmitted with it are confidential and are
> intended solely for the use of the individual or entity to whom they are
> addressed. If you are not the original recipient or the person responsible
> for delivering the email to the intended recipient, be advised that you
> have received this email in error, and that any use, dissemination,
> forwarding, printing, or copying of this email is strictly prohibited. If
> you received this email in error, please immediately notify the sender and
> delete the original.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120901/404aac80/attachment.htm>

From emi2fast at gmail.com  Sat Sep  1 10:04:39 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Sat, 1 Sep 2012 12:04:39 +0200
Subject: [Linux-cluster] Services getting stuck on node
In-Reply-To: <F7D01032076B4C4BAFA5A43ECD5DC50C24C1FE@EDI2EXMBX04.ioinc.ioroot.tld>
References: <F7D01032076B4C4BAFA5A43ECD5DC50C24C1FE@EDI2EXMBX04.ioinc.ioroot.tld>
Message-ID: <CAE7pJ3CchK-A3WkDn10Hz9kdYidqOrvMRoHt4n2XOdWbdLAD_g@mail.gmail.com>

Hello Colin

maybe your service doesn't switch because this happen
======================================================
Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop
cleanly
Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
======================================================

for debug your service stop, you can use rg_test test
/etc/cluster/cluster.conf stop service <NAME_OF_SERVICE>

for help you think is more easy if you show your cluster.conf

Thanks :-)

2012/9/1 Colin Simpson <Colin.Simpson at iongeo.com>

> Hi
>
> I had a strange issue this afternoon. One of my cluster nodes died
> (possible hw fault or driver issue). But the other node failed to take a
> number of it's services (2 node cluster), when it was successfully fenced.
>
> The clustat indicated that the services were on still on the original node
> (started) but the top lines correctly stated that the node was "offline".
>  The rgmanager log says for this event:
>
> Aug 31 17:19:30 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:30 rgmanager [ip] Local ping to 10.10.1.45 succeeded
> Aug 31 17:19:37 rgmanager State change: bld1uxn1i DOWN
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.46, Level 10
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.45, Level 0
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.33, Level 0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.46 present on bond0
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.43, Level 0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.45 present on bond0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.33 present on bond0
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.43 present on bond0
> Aug 31 17:19:49 rgmanager Taking over service service:nfsdprj from down
> member bld1uxn1i
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager #47: Failed changing service status
> Aug 31 17:19:49 rgmanager Taking over service service:httpd from down
> member bld1uxn1i
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager #47: Failed changing service status
> Aug 31 17:19:49 rgmanager [ip] Local ping to 10.10.1.46 succeeded
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop
> cleanly
> Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
> A couple of other services did successfully switch after this.
>
> I have seem this a few times (randomly) on various clusters since around
> the time of upgrading to 6.3 from 6.2 (services refusing to cleanly stop on
> a node). It's hard to reproduce and when down we usually just want a
> restart as fast as possible (thereby limiting time for debugging).
>
> How can I see what is causing the "#47: Failed changing service status" or
> any more debugging we can turn on in rgmanager to help with this?
>
> Or better still has anyone else seen anything like this?
>
> Thanks
>
> Colin
>
> ________________________________
>
>
> This email and any files transmitted with it are confidential and are
> intended solely for the use of the individual or entity to whom they are
> addressed. If you are not the original recipient or the person responsible
> for delivering the email to the intended recipient, be advised that you
> have received this email in error, and that any use, dissemination,
> forwarding, printing, or copying of this email is strictly prohibited. If
> you received this email in error, please immediately notify the sender and
> delete the original.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120901/2901759a/attachment.htm>

From Colin.Simpson at iongeo.com  Sat Sep  1 12:56:47 2012
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Sat, 1 Sep 2012 12:56:47 +0000
Subject: [Linux-cluster] Services getting stuck on node
In-Reply-To: <CAE7pJ3CchK-A3WkDn10Hz9kdYidqOrvMRoHt4n2XOdWbdLAD_g@mail.gmail.com>
References: <F7D01032076B4C4BAFA5A43ECD5DC50C24C1FE@EDI2EXMBX04.ioinc.ioroot.tld>,
	<CAE7pJ3CchK-A3WkDn10Hz9kdYidqOrvMRoHt4n2XOdWbdLAD_g@mail.gmail.com>
Message-ID: <F7D01032076B4C4BAFA5A43ECD5DC50C24C666@EDI2EXMBX04.ioinc.ioroot.tld>

Thanks for getting back.

I'll try the debug shutdown with that command.

Though I think the "failed to stop cleanly" is far from clear what that means. The node it was running on has gone (was fenced) so there was nothing to stop before starting on this node.

Thanks

Colin
________________________________
From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] on behalf of emmanuel segura [emi2fast at gmail.com]
Sent: 01 September 2012 11:04
To: linux clustering
Subject: Re: [Linux-cluster] Services getting stuck on node

Hello Colin

maybe your service doesn't switch because this happen
======================================================
Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop cleanly
Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
======================================================

for debug your service stop, you can use rg_test test /etc/cluster/cluster.conf stop service <NAME_OF_SERVICE>

for help you think is more easy if you show your cluster.conf

Thanks :-)

2012/9/1 Colin Simpson <Colin.Simpson at iongeo.com<mailto:Colin.Simpson at iongeo.com>>
Hi

I had a strange issue this afternoon. One of my cluster nodes died (possible hw fault or driver issue). But the other node failed to take a number of it's services (2 node cluster), when it was successfully fenced.

The clustat indicated that the services were on still on the original node (started) but the top lines correctly stated that the node was "offline".  The rgmanager log says for this event:

Aug 31 17:19:30 rgmanager [ip] Link detected on bond0
Aug 31 17:19:30 rgmanager [ip] Local ping to 10.10.1.45 succeeded
Aug 31 17:19:37 rgmanager State change: bld1uxn1i DOWN
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.46, Level 10
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.45, Level 0
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.33, Level 0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.46 present on bond0
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.43, Level 0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.45 present on bond0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.33 present on bond0
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] 10.10.1.43 present on bond0
Aug 31 17:19:49 rgmanager Taking over service service:nfsdprj from down member bld1uxn1i
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager #47: Failed changing service status
Aug 31 17:19:49 rgmanager Taking over service service:httpd from down member bld1uxn1i
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager #47: Failed changing service status
Aug 31 17:19:49 rgmanager [ip] Local ping to 10.10.1.46 succeeded
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop cleanly
Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
A couple of other services did successfully switch after this.

I have seem this a few times (randomly) on various clusters since around the time of upgrading to 6.3 from 6.2 (services refusing to cleanly stop on a node). It's hard to reproduce and when down we usually just want a restart as fast as possible (thereby limiting time for debugging).

How can I see what is causing the "#47: Failed changing service status" or any more debugging we can turn on in rgmanager to help with this?

Or better still has anyone else seen anything like this?

Thanks

Colin

________________________________


This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.


--
Linux-cluster mailing list
Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
https://www.redhat.com/mailman/listinfo/linux-cluster


--
esta es mi vida e me la vivo hasta que dios quiera

________________________________


This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120901/c4b8daf2/attachment.htm>

From d_joshi84 at yahoo.com  Sat Sep  1 15:57:38 2012
From: d_joshi84 at yahoo.com (joshi dhaval)
Date: Sat, 1 Sep 2012 23:57:38 +0800 (SGT)
Subject: [Linux-cluster] Understanding Fencing
In-Reply-To: <1346084054.38090.YahooMailClassic@web190405.mail.sg3.yahoo.com>
Message-ID: <1346515058.79506.YahooMailClassic@web190404.mail.sg3.yahoo.com>

Hello,

I tried to read some documents on fencing, still bit confused with technology. ( i dont want to buy any extra hardware just for fencing ).

we are using HP DL 380 G6, G7 servers at out environment, only way i can see fencing possible in my environment is HP ILO.

what is PDU ? do i need to purchase separate device to enable fencing using PDU ?

is that IPMI is same as HP ILO ?

for above hardware what you suggest are the most reliable fencing techniques i should use ?

is that cross cable connection is possible just to check hearbeats like VCS has gab and llt ?

i am panning to configure 2 nodes cluster first once i will have confidence i will move it to 4 or 5 node cluster.

Regards,
Dhaval
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120901/6d85630d/attachment.htm>

From lists at alteeve.ca  Sat Sep  1 16:28:18 2012
From: lists at alteeve.ca (Digimer)
Date: Sat, 01 Sep 2012 12:28:18 -0400
Subject: [Linux-cluster] Understanding Fencing
In-Reply-To: <1346515058.79506.YahooMailClassic@web190404.mail.sg3.yahoo.com>
References: <1346515058.79506.YahooMailClassic@web190404.mail.sg3.yahoo.com>
Message-ID: <504237A2.8050409@alteeve.ca>

Side note, then i will answer in-line. When possible, please start a new 
email to a mailing list, instead of hitting "reply" on an existing 
message and then deleting the content. A lot of people's email clients 
threading breaks when an email isn't new.

On 09/01/2012 11:57 AM, joshi dhaval wrote:
> Hello,
>
> I tried to read some documents on fencing, still bit confused with
> technology. ( i dont want to buy any extra hardware just for fencing ).

Was this one of the things you read?

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing

> we are using HP DL 380 G6, G7 servers at out environment, only way i can
> see fencing possible in my environment is HP ILO.

Yes, you can use fence_ilo with that. I have done so myself and cover 
how to set it up here:

https://alteeve.ca/w/Configuring_HP_iLO_2_on_EL6

and how to use it as a fence device here:

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_HP_iLO

> what is PDU ? do i need to purchase separate device to enable fencing
> using PDU ?

A PDU (Power Distribution Unit) is, by itself, just another name for a 
power bar, though it generally refers to rack-mounted power bars. In 
fencing though, we use a version called a "Switched PDU". These are 
power bars with a network connection. They allow you to connect remotely 
and turn each outlet on and off independently of the other ports. They 
also offer power monitoring and so one, but that's outside fencing.

So in fencing, if for example, the power supply failed then the server 
would power down and take the IPMI or iLO interface with it (see below). 
Without any power at all, the IPMI will not reply as it will also have 
no power. We know in this case that the node is gone, but the other 
nodes don't. All they know is that they can't talk to the node or it's 
IPMI/iLO interfaces, which could just as well be network outage leaving 
the node alive.

In this case, the cluster can call the Switched PDUs and ask them to 
turn off the outlet(s) feeding the server. When the PDUs say "ok, 
they're off", *then* the cluster can safely say "ok, now I know it has 
to be off" and can begin recovery.

> is that IPMI is same as HP ILO ?

No, but they are similar. I have a short write-up of it here:

https://alteeve.ca/w/IPMI

IPMI is a generic way for a server to offer "Out of Band" management. 
That is just a fancy way of saying "You can check on the state of the 
server even when the server is powered off".

The piece of hardware inside your server that provides IPMI is called a 
"BMC" (Baseboard Management Controller). Think of it like a little, 
separate computer sitting on your server's motherboard. It draws it's 
power from the host, it can read the host's sensors (power state, fans, 
temperatures, etc) but it is still a totally separate device.

In fencing, if one node stops responding (say because the OS crashed), 
another node in the cluster will call the victim's IPMI interface and 
say "please power off the host". The BMC then, effectively, "pushes and 
holds the power button" until the host shuts down. Then the IPMI device 
tells the caller that the power off was successful. The cluster then 
knows the state of the victim (it is powered off now) so it can safely 
recover.

As for the difference between IPMI and iLO;

Most major hardware vendors took IPMI and added a bunch of features on 
top of it. Then they renamed it to something they wanted. So HP called 
theirs "iLO", IBM called theirs "RSA", Dell called theirs "DRAC" and so 
on. These are all very similar to IPMI (some are similar enough that 
stock IPMI tools work with them).

> for above hardware what you suggest are the most reliable fencing
> techniques i should use ?

I would use 'fence_ilo'.

> is that cross cable connection is possible just to check hearbeats like
> VCS has gab and llt ?

I don't know VCS or llt so I can't comment. In RHCS, we use "corosync" 
for cluster membership. By default, it uses a multicast group for 
passing messages around the cluster and for detecting a node's death. 
It's similar to what I think you mean by "heartbeat". It is advised that 
you use a proper switch, though I do not believe it is required.

> i am panning to configure 2 nodes cluster first once i will have
> confidence i will move it to 4 or 5 node cluster.

Then definitely use a proper switch, not back to back.

> Regards,
> Dhaval

A final comment;

In clustering, a failed fence action will leave the cluster in a state 
where it does not know the condition of a member. Given the dangers of 
making an assumption, the cluster would rather block (hang) than proceed 
in a way that could cause damage. This is why fencing is so critical; It 
restores the cluster to a known state after a fault.

If you use only iLO for fencing (and many people do only use IPMI, iLO, 
etc), then you will be fine most of the time. For me personally, this is 
not good enough. If for any reason the other node(s) can't reach the 
IPMI or iLO interface, the fence action will fail and the cluster will 
hang. With a switched PDU, you have a backup fence device that would 
protect you against this by providing an alternate method of confirming 
the node's state. Thus, adding a switched PDU to your cluster, you 
remove another single point of failure.

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca


From kveri at kveri.com  Sun Sep  2 00:11:31 2012
From: kveri at kveri.com (Kveri)
Date: Sun, 2 Sep 2012 02:11:31 +0200
Subject: [Linux-cluster] GFS2
Message-ID: <59266966-EEC3-48E0-9703-1F9A3B9FB595@kveri.com>

Hello,

we're using gfs2 on drbd, we created cluster in incomplete state (only 1 node). When doing dd if=/dev/zero of=/gfs_partition/file we get filesystem freezes every 1-2 minutes for 10-20 seconds, I mean every filesystem on that machine freezes, doing ls /etc hangs in D state for 10-20 seconds. Sometimes this hang last for more than 2 minutes and hung task message gets logged in dmesg.

iotop shows gfs2_logd and flush-XXX:X kernel process taking 99% io resources.

GFS is mounted with rw,noatime,nodiratime,hostdata=jid=0 options.

gettune options:
quota_warn_period = 10
quota_quantum = 60
max_readahead = 262144
complain_secs = 10
statfs_slow = 0
quota_simul_sync = 64
statfs_quantum = 30
quota_scale = 1.0000   (1, 1)
new_files_jdata = 0

Server is kernel 3.2.0-25 64bit.

What could be the problem?

Thank you.

Martin


From kveri at kveri.com  Sun Sep  2 15:02:01 2012
From: kveri at kveri.com (Kveri)
Date: Sun, 2 Sep 2012 17:02:01 +0200
Subject: [Linux-cluster] gfs2_logd eating 99% io, random filesystem freezes
Message-ID: <1B898DB9-53D1-4982-8954-0F7DB2C2387F@kveri.com>

Hello,

we're using gfs2 on drbd, we created cluster in incomplete state (only 1 node). When doing dd if=/dev/zero of=/gfs_partition/file we get filesystem freezes every 1-2 minutes for 10-20 seconds, I mean every filesystem on that machine freezes, doing ls /etc hangs in D state for 10-20 seconds. Sometimes this hang last for more than 2 minutes and hung task message gets logged in dmesg.

iotop shows gfs2_logd and flush-XXX:X kernel process taking 99% io resources.

GFS is mounted with rw,noatime,nodiratime,hostdata=jid=0 options.

gettune options:
quota_warn_period = 10
quota_quantum = 60
max_readahead = 262144
complain_secs = 10
statfs_slow = 0
quota_simul_sync = 64
statfs_quantum = 30
quota_scale = 1.0000   (1, 1)
new_files_jdata = 0

Server is kernel 3.2.0-25 64bit.

Dmesg error (we did echo 1 > /proc/sys/kernel/hung_task_timeout_secs, but we also tested it with 120 secs):
[  818.882147] INFO: task ls:3531 blocked for more than 1 seconds.
[  818.882479] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  818.882929] ls              D ffff8803639364e0     0  3531   3269 0x00000000
[  818.882932]  ffff88033c789c58 0000000000000082 ffff88033c789be8 ffff8801e9c33780
[  818.882936]  ffff88033c789fd8 ffff88033c789fd8 ffff88033c789fd8 0000000000013780
[  818.882940]  ffff8801e5a72e00 ffff8801e5b32e00 0000000000000286 ffff88033c789ce0
[  818.882943] Call Trace:
[  818.882950]  [<ffffffffa02d2300>] ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
[  818.882953]  [<ffffffff816579cf>] schedule+0x3f/0x60
[  818.882959]  [<ffffffffa02d230e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
[  818.882963]  [<ffffffff8165829f>] __wait_on_bit+0x5f/0x90
[  818.882965]  [<ffffffff816598de>] ? _raw_spin_lock+0xe/0x20
[  818.882972]  [<ffffffffa02d2300>] ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
[  818.882975]  [<ffffffff8165834c>] out_of_line_wait_on_bit+0x7c/0x90
[  818.882978]  [<ffffffff8108aa90>] ? autoremove_wake_function+0x40/0x40
[  818.882985]  [<ffffffffa02d4467>] gfs2_glock_wait+0x47/0x90 [gfs2]
[  818.882992]  [<ffffffffa02d5d48>] gfs2_glock_nq+0x318/0x440 [gfs2]
[  818.882998]  [<ffffffff81161cff>] ? kmem_cache_free+0x2f/0x110
[  818.883007]  [<ffffffffa02e3ccb>] gfs2_getattr+0xbb/0xf0 [gfs2]
[  818.883015]  [<ffffffffa02e3cc2>] ? gfs2_getattr+0xb2/0xf0 [gfs2]
[  818.883020]  [<ffffffff8117c79e>] vfs_getattr+0x4e/0x80
[  818.883023]  [<ffffffff8117c81e>] vfs_fstatat+0x4e/0x70
[  818.883026]  [<ffffffff8117c85e>] vfs_lstat+0x1e/0x20
[  818.883029]  [<ffffffff8117c9fa>] sys_newlstat+0x1a/0x40
[  818.883033]  [<ffffffff811971cf>] ? mntput+0x1f/0x30
[  818.883036]  [<ffffffff81182652>] ? path_put+0x22/0x30
[  818.883039]  [<ffffffff8119bc1b>] ? sys_lgetxattr+0x5b/0x70
[  818.883042]  [<ffffffff81661ec2>] system_call_fastpath+0x16/0x1b


What could be the problem?

Thank you.

Martin


From member at linkedin.com  Tue Sep  4 11:42:10 2012
From: member at linkedin.com (Jose Nuno Neto via LinkedIn)
Date: Tue, 4 Sep 2012 11:42:10 +0000 (UTC)
Subject: [Linux-cluster] Invitation to connect on LinkedIn
Message-ID: <1222223745.6147249.1346758930716.JavaMail.app@ela4-app2316.prod>

LinkedIn
------------


    Jose Nuno Neto requested to add you as a connection on LinkedIn:
  

------------------------------------------

Krishna,

I'd like to add you to my professional network on LinkedIn.

- Jose Nuno

Accept invitation from Jose Nuno Neto
http://www.linkedin.com/e/-odgn7o-h6oxhtld-67/ulDuieLaAX544oVCOYcgj_GaXIys4TuLMXGmOx/blk/I173266432_45/0UcDpKqiRzolZKqiRybmRSrCBvrmRLoORIrmkZt5YCpnlOt3RApnhMpmdzgmhxrSNBszYRd5YOcPgSdz8PdP59bSB4qjhPtQpSbPcOc3sMc34Vd3wLrCBxbOYWrSlI/eml-comm_invm-b-in_ac-inv28/?hs=false&tok=2gESaHQQGuvRo1

View profile of Jose Nuno Neto
http://www.linkedin.com/e/-odgn7o-h6oxhtld-67/rso/3659852/fdfW/name/46069589_I173266432_45/?hs=false&tok=1gQ-DtZUiuvRo1
------------------------------------------
You are receiving Invitation emails.


This email was intended for Krishna Kumar.
Learn why this is included: http://www.linkedin.com/e/-odgn7o-h6oxhtld-67/plh/http%3A%2F%2Fhelp%2Elinkedin%2Ecom%2Fapp%2Fanswers%2Fdetail%2Fa_id%2F4788/-GXI/?hs=false&tok=3Gfm08dEOuvRo1

(c) 2012, LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120904/9c4350b1/attachment.htm>

From td3201 at gmail.com  Tue Sep  4 15:01:25 2012
From: td3201 at gmail.com (Terry)
Date: Tue, 4 Sep 2012 10:01:25 -0500
Subject: [Linux-cluster] NFS locks and failing over services
Message-ID: <CAHSRzpAPYonHvcD3A83VCEd_xWts7=-CvcLFcLdectLtW=EUxQ@mail.gmail.com>

Hello,

I am running an NFS cluster with 3 exports distributed across 2 nodes.
 When I try to relocate an NFS export, it fails.  I then have to disable
and enable it on the other node.  Does anyone have any tricks to get around
this issue?  I am sure it is due to file locking.  Here's the config:

*<?xml version="1.0" ?>*

*<cluster config_version="17" name="omadvnfs01">*

*        <cman expected_votes="1" two_node="1"/>*

*        <clusternodes>*

*                <clusternode name="omadvnfs01a.sec.jel.lc" nodeid="1"
votes="1">*

*                        <fence>*

*                                <method name="drac">*

*                                        <device name="omadvnfs01a-drac"/>*

*                                </method>*

*                        </fence>*

*                </clusternode>*

*                <clusternode name="omadvnfs01b.sec.jel.lc" nodeid="2"
votes="1">*

*                        <fence>*

*                                <method name="drac">*

*                                        <device name="omadvnfs01b-drac"/>*

*                                </method>*

*                        </fence>*

*                </clusternode>*

*        </clusternodes>*

*        <rm>*

*                <failoverdomains>*

*                        <failoverdomain name="fd_omadvnfs01a" ordered="1"
restricted="0" nofailback="1">*

*                                <failoverdomainnode name="
omadvnfs01a.sec.jel.lc" priority="1"/>*

*                                <failoverdomainnode name="
omadvnfs01b.sec.jel.lc" priority="2"/>*

*                        </failoverdomain>*

*                        <failoverdomain name="fd_omadvnfs01b" ordered="1"
restricted="0" nofailback="1">*

*                                <failoverdomainnode name="
omadvnfs01b.sec.jel.lc" priority="1"/>*

*                                <failoverdomainnode name="
omadvnfs01a.sec.jel.lc" priority="2"/>*

*                        </failoverdomain>*

*                </failoverdomains>*

*                <resources>*

*                        <ip address="10.198.1.112" monitor_link="1"/>*

*                        <ip address="10.198.1.113" monitor_link="1"/>*

*                        <ip address="10.198.1.114" monitor_link="1"/>*

*                        <ip address="10.198.1.115" monitor_link="1"/>*

*                        <fs device="/dev/vg_data01a/lv_data01a"
quick_status="0" force_fsck="0" force_unmount="1" fsid="27014"
fstype="ext4" mountpoint="/data01a" name="omadvnfs01-data01a"
options="rw,noatime,nodiratime,data=writeback,commit=30,_netdev"
self_fence="0" nfslock="1"/>*

*                        <fs device="/dev/vg_data01b/lv_data01b"
quick_status="0" force_fsck="0" force_unmount="1" fsid="39436"
fstype="ext4" mountpoint="/data01b" name="omadvnfs01-data01b"
options="rw,noatime,nodiratime,data=writeback,commit=30,_netdev"
self_fence="0" nfslock="1"/>*

*                        <fs device="/dev/vg_data01c/lv_data01c"
quick_status="0" force_fsck="0" force_unmount="1" fsid="99243"
fstype="ext4" mountpoint="/data01c" name="omadvnfs01-data01c"
options="rw,noatime,nodiratime,data=writeback,commit=30,_netdev"
self_fence="0" nfslock="1"/>*

*                        <fs device="/dev/vg_data04/lv_data04"
quick_status="0" force_fsck="0" force_unmount="1" fsid="59016"
fstype="ext4" mountpoint="/data04" name="omadvnfs01-data04"
options="defaults,_netdev" self_fence="0"/>*

*                        <script file="/etc/init.d/postgresql"
name="postgresql"/>*

*                        <nfsexport name="data01a"/>*

*                        <nfsexport name="data01b"/>*

*                        <nfsexport name="data01c"/>*

*                        <nfsclient allow_recover="1" name="omadvdss01a"
options="async,no_root_squash,rw,no_subtree_check" target="omadvdss01a"/>*

*                        <nfsclient allow_recover="1" name="omadvdss01b"
options="async,no_root_squash,rw,no_subtree_check" target="omadvdss01b"/>*

*                        <nfsclient allow_recover="1" name="omadvdss01c"
options="async,no_root_squash,rw,no_subtree_check" target="omadvdss01c"/>*

*                </resources>*

*                <service autostart="1" domain="fd_omadvnfs01b"
name="postgresql">*

*                        <ip ref="10.198.1.112"/>*

*                        <fs ref="omadvnfs01-data04"/>*

*                        <script ref="postgresql"/>*

*                </service>*

*                <service domain="fd_omadvnfs01a"
name="omadvnfs01-nfs-data01a" nfslock="1" recovery="relocate">*

*                        <ip ref="10.198.1.113"/>*

*                        <fs ref="omadvnfs01-data01a">*

*                                <nfsexport ref="data01a">*

*                                        <nfsclient ref="omadvdss01a"/>*

*                                        <nfsclient ref="omadvdss01b"/>*

*                                        <nfsclient ref="omadvdss01c"/>*

*                                </nfsexport>*

*                        </fs>*

*                </service>*

*                <service domain="fd_omadvnfs01a"
name="omadvnfs01-nfs-data01b" nfslock="1" recovery="relocate">*

*                        <ip ref="10.198.1.114"/>*

*                        <fs ref="omadvnfs01-data01b">*

*                                <nfsexport ref="data01b">*

*                                        <nfsclient ref="omadvdss01a"/>*

*                                        <nfsclient ref="omadvdss01b"/>*

*                                        <nfsclient ref="omadvdss01c"/>*

*                                </nfsexport>*

*                        </fs>*

*                </service>*

*                <service domain="fd_omadvnfs01b"
name="omadvnfs01-nfs-data01c" nfslock="1" recovery="relocate">*

*                        <ip ref="10.198.1.115"/>*

*                        <fs ref="omadvnfs01-data01c">*

*                                <nfsexport ref="data01c">*

*                                        <nfsclient ref="omadvdss01a"/>*

*                                        <nfsclient ref="omadvdss01b"/>*

*                                        <nfsclient ref="omadvdss01c"/>*

*                                </nfsexport>*

*                        </fs>*

*                </service>*

*        </rm>*

*</cluster>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120904/c355125d/attachment.htm>

From fdinitto at redhat.com  Tue Sep  4 18:04:49 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 04 Sep 2012 20:04:49 +0200
Subject: [Linux-cluster] NFS locks and failing over services
In-Reply-To: <CAHSRzpAPYonHvcD3A83VCEd_xWts7=-CvcLFcLdectLtW=EUxQ@mail.gmail.com>
References: <CAHSRzpAPYonHvcD3A83VCEd_xWts7=-CvcLFcLdectLtW=EUxQ@mail.gmail.com>
Message-ID: <504642C1.4060703@redhat.com>

On 09/04/2012 05:01 PM, Terry wrote:
> Hello,
> 
> I am running an NFS cluster with 3 exports distributed across 2 nodes.
>  When I try to relocate an NFS export, it fails.  I then have to disable
> and enable it on the other node.  Does anyone have any tricks to get
> around this issue?  I am sure it is due to file locking.  Here's the config:

This is a difficult config to support because of some design limitations
in nfsd and what cluster users expect (exactly your config).

My best guess is that the service fails to relocate because the fs
cannot be unmounted.

If so, you need to add force_unmount="1" to the fs resources.

If that still doesn't fix the problem, you need to upgrade to a version
of the resource-agents that support nfsrestart="1" for fs resource.
This depends on what distro/release you have. nfsrestart does a much
harder (invasive) action to drop the locks holding the fs. Still
requires force_unmount to be set. It can be used together with nfslock
(if nfslock fails, then nfsrestart will kick in).

Fabio

> 
>         /<?xml version="1.0" ?>/
> 
>         /<cluster config_version="17" name="omadvnfs01">/
> 
>         /        <cman expected_votes="1" two_node="1"/>/
> 
>         /        <clusternodes>/
> 
>         /                <clusternode name="omadvnfs01a.sec.jel.lc
>         <http://omadvnfs01a.sec.jel.lc>" nodeid="1" votes="1">/
> 
>         /                        <fence>/
> 
>         /                                <method name="drac">/
> 
>         /                                        <device
>         name="omadvnfs01a-drac"/>/
> 
>         /                                </method>/
> 
>         /                        </fence>/
> 
>         /                </clusternode>/
> 
>         /                <clusternode name="omadvnfs01b.sec.jel.lc
>         <http://omadvnfs01b.sec.jel.lc>" nodeid="2" votes="1">/
> 
>         /                        <fence>/
> 
>         /                                <method name="drac">/
> 
>         /                                        <device
>         name="omadvnfs01b-drac"/>/
> 
>         /                                </method>/
> 
>         /                        </fence>/
> 
>         /                </clusternode>/
> 
>         /        </clusternodes>/
> 
>         /        <rm>/
> 
>         /                <failoverdomains>/
> 
>         /                        <failoverdomain name="fd_omadvnfs01a"
>         ordered="1" restricted="0" nofailback="1">/
> 
>         /                                <failoverdomainnode
>         name="omadvnfs01a.sec.jel.lc <http://omadvnfs01a.sec.jel.lc>"
>         priority="1"/>/
> 
>         /                                <failoverdomainnode
>         name="omadvnfs01b.sec.jel.lc <http://omadvnfs01b.sec.jel.lc>"
>         priority="2"/>/
> 
>         /                        </failoverdomain>/
> 
>         /                        <failoverdomain name="fd_omadvnfs01b"
>         ordered="1" restricted="0" nofailback="1">/
> 
>         /                                <failoverdomainnode
>         name="omadvnfs01b.sec.jel.lc <http://omadvnfs01b.sec.jel.lc>"
>         priority="1"/>/
> 
>         /                                <failoverdomainnode
>         name="omadvnfs01a.sec.jel.lc <http://omadvnfs01a.sec.jel.lc>"
>         priority="2"/>/
> 
>         /                        </failoverdomain>/
> 
>         /                </failoverdomains>/
> 
>         /                <resources>/
> 
>         /                        <ip address="10.198.1.112"
>         monitor_link="1"/>/
> 
>         /                        <ip address="10.198.1.113"
>         monitor_link="1"/>/
> 
>         /                        <ip address="10.198.1.114"
>         monitor_link="1"/>/
> 
>         /                        <ip address="10.198.1.115"
>         monitor_link="1"/>/
> 
>         /                        <fs device="/dev/vg_data01a/lv_data01a"
>         quick_status="0" force_fsck="0" force_unmount="1" fsid="27014"
>         fstype="ext4" mountpoint="/data01a" name="omadvnfs01-data01a"
>         options="rw,noatime,nodiratime,data=writeback,commit=30,_netdev"
>         self_fence="0" nfslock="1"/>/
> 
>         /                        <fs device="/dev/vg_data01b/lv_data01b"
>         quick_status="0" force_fsck="0" force_unmount="1" fsid="39436"
>         fstype="ext4" mountpoint="/data01b" name="omadvnfs01-data01b"
>         options="rw,noatime,nodiratime,data=writeback,commit=30,_netdev"
>         self_fence="0" nfslock="1"/>/
> 
>         /                        <fs device="/dev/vg_data01c/lv_data01c"
>         quick_status="0" force_fsck="0" force_unmount="1" fsid="99243"
>         fstype="ext4" mountpoint="/data01c" name="omadvnfs01-data01c"
>         options="rw,noatime,nodiratime,data=writeback,commit=30,_netdev"
>         self_fence="0" nfslock="1"/>/
> 
>         /                        <fs device="/dev/vg_data04/lv_data04"
>         quick_status="0" force_fsck="0" force_unmount="1" fsid="59016"
>         fstype="ext4" mountpoint="/data04" name="omadvnfs01-data04"
>         options="defaults,_netdev" self_fence="0"/>/
> 
>         /                        <script file="/etc/init.d/postgresql"
>         name="postgresql"/>/
> 
>         /                        <nfsexport name="data01a"/>/
> 
>         /                        <nfsexport name="data01b"/>/
> 
>         /                        <nfsexport name="data01c"/>/
> 
>         /                        <nfsclient allow_recover="1"
>         name="omadvdss01a"
>         options="async,no_root_squash,rw,no_subtree_check"
>         target="omadvdss01a"/>/
> 
>         /                        <nfsclient allow_recover="1"
>         name="omadvdss01b"
>         options="async,no_root_squash,rw,no_subtree_check"
>         target="omadvdss01b"/>/
> 
>         /                        <nfsclient allow_recover="1"
>         name="omadvdss01c"
>         options="async,no_root_squash,rw,no_subtree_check"
>         target="omadvdss01c"/>/
> 
>         /                </resources>/
> 
>         /                <service autostart="1" domain="fd_omadvnfs01b"
>         name="postgresql">/
> 
>         /                        <ip ref="10.198.1.112"/>/
> 
>         /                        <fs ref="omadvnfs01-data04"/>/
> 
>         /                        <script ref="postgresql"/>/
> 
>         /                </service>/
> 
>         /                <service domain="fd_omadvnfs01a"
>         name="omadvnfs01-nfs-data01a" nfslock="1" recovery="relocate">/
> 
>         /                        <ip ref="10.198.1.113"/>/
> 
>         /                        <fs ref="omadvnfs01-data01a">/
> 
>         /                                <nfsexport ref="data01a">/
> 
>         /                                        <nfsclient
>         ref="omadvdss01a"/>/
> 
>         /                                        <nfsclient
>         ref="omadvdss01b"/>/
> 
>         /                                        <nfsclient
>         ref="omadvdss01c"/>/
> 
>         /                                </nfsexport>/
> 
>         /                        </fs>/
> 
>         /                </service>/
> 
>         /                <service domain="fd_omadvnfs01a"
>         name="omadvnfs01-nfs-data01b" nfslock="1" recovery="relocate">/
> 
>         /                        <ip ref="10.198.1.114"/>/
> 
>         /                        <fs ref="omadvnfs01-data01b">/
> 
>         /                                <nfsexport ref="data01b">/
> 
>         /                                        <nfsclient
>         ref="omadvdss01a"/>/
> 
>         /                                        <nfsclient
>         ref="omadvdss01b"/>/
> 
>         /                                        <nfsclient
>         ref="omadvdss01c"/>/
> 
>         /                                </nfsexport>/
> 
>         /                        </fs>/
> 
>         /                </service>/
> 
>         /                <service domain="fd_omadvnfs01b"
>         name="omadvnfs01-nfs-data01c" nfslock="1" recovery="relocate">/
> 
>         /                        <ip ref="10.198.1.115"/>/
> 
>         /                        <fs ref="omadvnfs01-data01c">/
> 
>         /                                <nfsexport ref="data01c">/
> 
>         /                                        <nfsclient
>         ref="omadvdss01a"/>/
> 
>         /                                        <nfsclient
>         ref="omadvdss01b"/>/
> 
>         /                                        <nfsclient
>         ref="omadvdss01c"/>/
> 
>         /                                </nfsexport>/
> 
>         /                        </fs>/
> 
>         /                </service>/
> 
>         /        </rm>/
> 
>         /</cluster>/
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


From td3201 at gmail.com  Tue Sep  4 18:20:48 2012
From: td3201 at gmail.com (Terry)
Date: Tue, 4 Sep 2012 13:20:48 -0500
Subject: [Linux-cluster] NFS locks and failing over services
In-Reply-To: <504642C1.4060703@redhat.com>
References: <CAHSRzpAPYonHvcD3A83VCEd_xWts7=-CvcLFcLdectLtW=EUxQ@mail.gmail.com>
	<504642C1.4060703@redhat.com>
Message-ID: <CAHSRzpAHOnN+=VbXDTF5Js1290O8DJqkgPSAf3TH6hfFsg6Hpw@mail.gmail.com>

On Tue, Sep 4, 2012 at 1:04 PM, Fabio M. Di Nitto <fdinitto at redhat.com>wrote:

> On 09/04/2012 05:01 PM, Terry wrote:
> > Hello,
> >
> > I am running an NFS cluster with 3 exports distributed across 2 nodes.
> >  When I try to relocate an NFS export, it fails.  I then have to disable
> > and enable it on the other node.  Does anyone have any tricks to get
> > around this issue?  I am sure it is due to file locking.  Here's the
> config:
>
> This is a difficult config to support because of some design limitations
> in nfsd and what cluster users expect (exactly your config).
>
> My best guess is that the service fails to relocate because the fs
> cannot be unmounted.
>
> If so, you need to add force_unmount="1" to the fs resources.
>
> If that still doesn't fix the problem, you need to upgrade to a version
> of the resource-agents that support nfsrestart="1" for fs resource.
> This depends on what distro/release you have. nfsrestart does a much
> harder (invasive) action to drop the locks holding the fs. Still
> requires force_unmount to be set. It can be used together with nfslock
> (if nfslock fails, then nfsrestart will kick in).
>
> Fabio
>
> >
> >         /<?xml version="1.0" ?>/
> >
> >         /<cluster config_version="17" name="omadvnfs01">/
> >
> >         /        <cman expected_votes="1" two_node="1"/>/
> >
> >         /        <clusternodes>/
> >
> >         /                <clusternode name="omadvnfs01a.sec.jel.lc
> >         <http://omadvnfs01a.sec.jel.lc>" nodeid="1" votes="1">/
> >
> >         /                        <fence>/
> >
> >         /                                <method name="drac">/
> >
> >         /                                        <device
> >         name="omadvnfs01a-drac"/>/
> >
> >         /                                </method>/
> >
> >         /                        </fence>/
> >
> >         /                </clusternode>/
> >
> >         /                <clusternode name="omadvnfs01b.sec.jel.lc
> >         <http://omadvnfs01b.sec.jel.lc>" nodeid="2" votes="1">/
> >
> >         /                        <fence>/
> >
> >         /                                <method name="drac">/
> >
> >         /                                        <device
> >         name="omadvnfs01b-drac"/>/
> >
> >         /                                </method>/
> >
> >         /                        </fence>/
> >
> >         /                </clusternode>/
> >
> >         /        </clusternodes>/
> >
> >         /        <rm>/
> >
> >         /                <failoverdomains>/
> >
> >         /                        <failoverdomain name="fd_omadvnfs01a"
> >         ordered="1" restricted="0" nofailback="1">/
> >
> >         /                                <failoverdomainnode
> >         name="omadvnfs01a.sec.jel.lc <http://omadvnfs01a.sec.jel.lc>"
> >         priority="1"/>/
> >
> >         /                                <failoverdomainnode
> >         name="omadvnfs01b.sec.jel.lc <http://omadvnfs01b.sec.jel.lc>"
> >         priority="2"/>/
> >
> >         /                        </failoverdomain>/
> >
> >         /                        <failoverdomain name="fd_omadvnfs01b"
> >         ordered="1" restricted="0" nofailback="1">/
> >
> >         /                                <failoverdomainnode
> >         name="omadvnfs01b.sec.jel.lc <http://omadvnfs01b.sec.jel.lc>"
> >         priority="1"/>/
> >
> >         /                                <failoverdomainnode
> >         name="omadvnfs01a.sec.jel.lc <http://omadvnfs01a.sec.jel.lc>"
> >         priority="2"/>/
> >
> >         /                        </failoverdomain>/
> >
> >         /                </failoverdomains>/
> >
> >         /                <resources>/
> >
> >         /                        <ip address="10.198.1.112"
> >         monitor_link="1"/>/
> >
> >         /                        <ip address="10.198.1.113"
> >         monitor_link="1"/>/
> >
> >         /                        <ip address="10.198.1.114"
> >         monitor_link="1"/>/
> >
> >         /                        <ip address="10.198.1.115"
> >         monitor_link="1"/>/
> >
> >         /                        <fs device="/dev/vg_data01a/lv_data01a"
> >         quick_status="0" force_fsck="0" force_unmount="1" fsid="27014"
> >         fstype="ext4" mountpoint="/data01a" name="omadvnfs01-data01a"
> >         options="rw,noatime,nodiratime,data=writeback,commit=30,_netdev"
> >         self_fence="0" nfslock="1"/>/
> >
> >         /                        <fs device="/dev/vg_data01b/lv_data01b"
> >         quick_status="0" force_fsck="0" force_unmount="1" fsid="39436"
> >         fstype="ext4" mountpoint="/data01b" name="omadvnfs01-data01b"
> >         options="rw,noatime,nodiratime,data=writeback,commit=30,_netdev"
> >         self_fence="0" nfslock="1"/>/
> >
> >         /                        <fs device="/dev/vg_data01c/lv_data01c"
> >         quick_status="0" force_fsck="0" force_unmount="1" fsid="99243"
> >         fstype="ext4" mountpoint="/data01c" name="omadvnfs01-data01c"
> >         options="rw,noatime,nodiratime,data=writeback,commit=30,_netdev"
> >         self_fence="0" nfslock="1"/>/
> >
> >         /                        <fs device="/dev/vg_data04/lv_data04"
> >         quick_status="0" force_fsck="0" force_unmount="1" fsid="59016"
> >         fstype="ext4" mountpoint="/data04" name="omadvnfs01-data04"
> >         options="defaults,_netdev" self_fence="0"/>/
> >
> >         /                        <script file="/etc/init.d/postgresql"
> >         name="postgresql"/>/
> >
> >         /                        <nfsexport name="data01a"/>/
> >
> >         /                        <nfsexport name="data01b"/>/
> >
> >         /                        <nfsexport name="data01c"/>/
> >
> >         /                        <nfsclient allow_recover="1"
> >         name="omadvdss01a"
> >         options="async,no_root_squash,rw,no_subtree_check"
> >         target="omadvdss01a"/>/
> >
> >         /                        <nfsclient allow_recover="1"
> >         name="omadvdss01b"
> >         options="async,no_root_squash,rw,no_subtree_check"
> >         target="omadvdss01b"/>/
> >
> >         /                        <nfsclient allow_recover="1"
> >         name="omadvdss01c"
> >         options="async,no_root_squash,rw,no_subtree_check"
> >         target="omadvdss01c"/>/
> >
> >         /                </resources>/
> >
> >         /                <service autostart="1" domain="fd_omadvnfs01b"
> >         name="postgresql">/
> >
> >         /                        <ip ref="10.198.1.112"/>/
> >
> >         /                        <fs ref="omadvnfs01-data04"/>/
> >
> >         /                        <script ref="postgresql"/>/
> >
> >         /                </service>/
> >
> >         /                <service domain="fd_omadvnfs01a"
> >         name="omadvnfs01-nfs-data01a" nfslock="1" recovery="relocate">/
> >
> >         /                        <ip ref="10.198.1.113"/>/
> >
> >         /                        <fs ref="omadvnfs01-data01a">/
> >
> >         /                                <nfsexport ref="data01a">/
> >
> >         /                                        <nfsclient
> >         ref="omadvdss01a"/>/
> >
> >         /                                        <nfsclient
> >         ref="omadvdss01b"/>/
> >
> >         /                                        <nfsclient
> >         ref="omadvdss01c"/>/
> >
> >         /                                </nfsexport>/
> >
> >         /                        </fs>/
> >
> >         /                </service>/
> >
> >         /                <service domain="fd_omadvnfs01a"
> >         name="omadvnfs01-nfs-data01b" nfslock="1" recovery="relocate">/
> >
> >         /                        <ip ref="10.198.1.114"/>/
> >
> >         /                        <fs ref="omadvnfs01-data01b">/
> >
> >         /                                <nfsexport ref="data01b">/
> >
> >         /                                        <nfsclient
> >         ref="omadvdss01a"/>/
> >
> >         /                                        <nfsclient
> >         ref="omadvdss01b"/>/
> >
> >         /                                        <nfsclient
> >         ref="omadvdss01c"/>/
> >
> >         /                                </nfsexport>/
> >
> >         /                        </fs>/
> >
> >         /                </service>/
> >
> >         /                <service domain="fd_omadvnfs01b"
> >         name="omadvnfs01-nfs-data01c" nfslock="1" recovery="relocate">/
> >
> >         /                        <ip ref="10.198.1.115"/>/
> >
> >         /                        <fs ref="omadvnfs01-data01c">/
> >
> >         /                                <nfsexport ref="data01c">/
> >
> >         /                                        <nfsclient
> >         ref="omadvdss01a"/>/
> >
> >         /                                        <nfsclient
> >         ref="omadvdss01b"/>/
> >
> >         /                                        <nfsclient
> >         ref="omadvdss01c"/>/
> >
> >         /                                </nfsexport>/
> >
> >         /                        </fs>/
> >
> >         /                </service>/
> >
> >         /        </rm>/
> >
> >         /</cluster>/
> >
> >
> >
>
> I am on ubuntu 12.04.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120904/4933f85d/attachment.htm>

From fdinitto at redhat.com  Wed Sep  5 04:43:32 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 05 Sep 2012 06:43:32 +0200
Subject: [Linux-cluster] NFS locks and failing over services
In-Reply-To: <CAHSRzpAHOnN+=VbXDTF5Js1290O8DJqkgPSAf3TH6hfFsg6Hpw@mail.gmail.com>
References: <CAHSRzpAPYonHvcD3A83VCEd_xWts7=-CvcLFcLdectLtW=EUxQ@mail.gmail.com>
	<504642C1.4060703@redhat.com>
	<CAHSRzpAHOnN+=VbXDTF5Js1290O8DJqkgPSAf3TH6hfFsg6Hpw@mail.gmail.com>
Message-ID: <5046D874.8060103@redhat.com>

On 09/04/2012 08:20 PM, Terry wrote:
> On Tue, Sep 4, 2012 at 1:04 PM, Fabio M. Di Nitto <fdinitto at redhat.com
> <mailto:fdinitto at redhat.com>> wrote:
> 
>     On 09/04/2012 05:01 PM, Terry wrote:
>     > Hello,
>     >
>     > I am running an NFS cluster with 3 exports distributed across 2 nodes.
>     >  When I try to relocate an NFS export, it fails.  I then have to
>     disable
>     > and enable it on the other node.  Does anyone have any tricks to get
>     > around this issue?  I am sure it is due to file locking.  Here's
>     the config:
> 
>     This is a difficult config to support because of some design limitations
>     in nfsd and what cluster users expect (exactly your config).
> 
>     My best guess is that the service fails to relocate because the fs
>     cannot be unmounted.
> 
>     If so, you need to add force_unmount="1" to the fs resources.
> 
>     If that still doesn't fix the problem, you need to upgrade to a version
>     of the resource-agents that support nfsrestart="1" for fs resource.
>     This depends on what distro/release you have. nfsrestart does a much
>     harder (invasive) action to drop the locks holding the fs. Still
>     requires force_unmount to be set. It can be used together with nfslock
>     (if nfslock fails, then nfsrestart will kick in).
> 
>     Fabio
> 

> 
> I am on ubuntu 12.04. 
> 


Start by setting up force_unmount, if that's not enough, then you will
need to contact ubuntu-ha team to pull in the fixes from upstream or
build your own resource-agents package to include nfsrestart option.

Fabio


From swhiteho at redhat.com  Wed Sep  5 09:56:17 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 05 Sep 2012 10:56:17 +0100
Subject: [Linux-cluster] GFS2
In-Reply-To: <59266966-EEC3-48E0-9703-1F9A3B9FB595@kveri.com>
References: <59266966-EEC3-48E0-9703-1F9A3B9FB595@kveri.com>
Message-ID: <1346838977.2799.33.camel@menhir>

Hi,

On Sun, 2012-09-02 at 02:11 +0200, Kveri wrote:
> Hello,
> 
> we're using gfs2 on drbd, we created cluster in incomplete state (only 1 node). When doing dd if=/dev/zero of=/gfs_partition/file we get filesystem freezes every 1-2 minutes for 10-20 seconds, I mean every filesystem on that machine freezes, doing ls /etc hangs in D state for 10-20 seconds. Sometimes this hang last for more than 2 minutes and hung task message gets logged in dmesg.
> 
> iotop shows gfs2_logd and flush-XXX:X kernel process taking 99% io resources.
> 
> GFS is mounted with rw,noatime,nodiratime,hostdata=jid=0 options.
> 

It sounds like the issue is related to the network and sending i/o to
the drbd device. If you've got a large backlog of cached and dirty data
(and that includes flushing the log) then it may take a while to send
that over the network. What is the speed of your network and what speed
doe the drbd device work at (i.e. can it accept data at line speed?)

It sounds to me as if the issue is that you are simply creating dirty
data at a rate far in excess of what the underlying hardware can cope
with, so it pauses now and then to catch up,

Steve.


> gettune options:
> quota_warn_period = 10
> quota_quantum = 60
> max_readahead = 262144
> complain_secs = 10
> statfs_slow = 0
> quota_simul_sync = 64
> statfs_quantum = 30
> quota_scale = 1.0000   (1, 1)
> new_files_jdata = 0
> 
> Server is kernel 3.2.0-25 64bit.
> 
> What could be the problem?
> 
> Thank you.
> 
> Martin
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From jpokorny at redhat.com  Wed Sep  5 13:34:32 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Wed, 5 Sep 2012 15:34:32 +0200
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2FE2D@alexandria.innova.local>
References: <20120820142841.GD537@redhat.com>
	<F6CC886A516DF049A33F32C149ED21C823C2FE2D@alexandria.innova.local>
Message-ID: <20120905133432.GC5074@redhat.com>

On 20/08/12 14:51 +0000, Chip Burke wrote:
> Thanks for sticking with me on this.

Sorry for delay.

The traceback from your initial email, which ended with:
> TypeError: No object (name: translator) has been registered for this
> thread

I overlooked this one, but independently hit it too [1].

But this is orthogonal to the main issue you are stating
and which I am having troubles to figure out.

Could you please try following as a possible workaround?
At the particular node (or all within the cluster), try running
rm -f /var/lib/ricci/certs/clients/* and restarting ricci.
As a consequence, you may be prompted for the password to authenticate
against ricci instance(s) and/or re-add the cluster in luci even if
this was already done before.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=853151

Thanks,
Jan


From zagar at arlut.utexas.edu  Wed Sep  5 14:59:11 2012
From: zagar at arlut.utexas.edu (Randy Zagar)
Date: Wed, 05 Sep 2012 09:59:11 -0500
Subject: [Linux-cluster] RHEL/CentOS-6 HA NFS Configuration Question
In-Reply-To: <mailman.39.1346342403.18903.linux-cluster@redhat.com>
References: <mailman.39.1346342403.18903.linux-cluster@redhat.com>
Message-ID: <504768BF.8050500@arlut.utexas.edu>

What I don't understand is what changed between RHEL-5 and RHEL-6 that 
has made HA NFS failover so difficult?

I have been running a 3-node CentOS-5 cluster serving home directories 
(via NFS) to 200+ users for several years now and have been able to fail 
over home directories without significant issues.

With CentOS-5, most of the time I am able to avoid stale filehandle 
issues.  So far, I've been unwilling to use CentOS-6 for HA NFS as I 
can't get failover to work properly.

What I'm hearing so far on this list is that it's impossible to use 
RHEL/CentOS-6 clusters for seamless NFS services.  My question is this, 
does the problem go away if you disable NFSv4 functionality?  Because my 
old servers are exporting as NFSv3 and I don't have the issues you guys 
are complaining about...

Here is a sanitized version of my configuration:

    <?xml version="1.0"?>
    <cluster alias="ha_nfs" config_version="371" name="ha_nfs">
    	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
    	<clusternodes>
    		<clusternode name="node01.arlut.utexas.edu" nodeid="1" votes="1">
    			<fence>
    				<method name="1">
    					<device name="node01-ilo"/>
    				</method>
    				<method name="2">
    					<device name="fc-switch03" port="4"/>
    				</method>
    			</fence>
    		</clusternode>
    		<clusternode name="node02.arlut.utexas.edu" nodeid="2" votes="1">
    			<fence>
    				<method name="1">
    					<device name="node02-ilo"/>
    				</method>
    				<method name="2">
    					<device name="fc-switch03" port="0"/>
    				</method>
    			</fence>
    		</clusternode>
    		<clusternode name="node03.arlut.utexas.edu" nodeid="3" votes="1">
    			<fence>
    				<method name="1">
    					<device name="node03-ilo"/>
    				</method>
    				<method name="2">
    					<device name="fc-switch04" port="0"/>
    				</method>
    			</fence>
    		</clusternode>
    	</clusternodes>
    	<cman/>
    	<fencedevices>
    		<fencedevice agent="fence_fc-switch01" ipaddr="fc-switch02"   login="dummy" name="fc-switch02" passwd="password"/>
    		<fencedevice agent="fence_fc-switch01" ipaddr="fc-switch03"   login="dummy" name="fc-switch03" passwd="password"/>
    		<fencedevice agent="fence_fc-switch01" ipaddr="fc-switch04"   login="dummy" name="fc-switch04" passwd="password"/>
    		<fencedevice agent="fence_fc-switch01" ipaddr="fc-switch05"   login="dummy" name="fc-switch05" passwd="password"/>
    		<fencedevice agent="fence_ilo"         hostname="node01-ilo"  login="dummy" name="node01-ilo"  passwd="password"/>
    		<fencedevice agent="fence_ilo"         hostname="node02-ilo"  login="dummy" name="node02-ilo"  passwd="password"/>
    		<fencedevice agent="fence_ilo"         hostname="node03-ilo"  login="dummy" name="node03-ilo"  passwd="password"/>
    	</fencedevices>
    	<rm>
    		<failoverdomains>
    			<failoverdomain name="nfs1-domain" nofailback="1" ordered="1" restricted="1">
    				<failoverdomainnode name="node01.arlut.utexas.edu" priority="1"/>
    				<failoverdomainnode name="node02.arlut.utexas.edu" priority="2"/>
    				<failoverdomainnode name="node03.arlut.utexas.edu" priority="3"/>
    			</failoverdomain>
    			<failoverdomain name="nfs2-domain" nofailback="1" ordered="1" restricted="1">
    				<failoverdomainnode name="node02.arlut.utexas.edu" priority="1"/>
    				<failoverdomainnode name="node03.arlut.utexas.edu" priority="2"/>
    				<failoverdomainnode name="node01.arlut.utexas.edu" priority="3"/>
    			</failoverdomain>
    			<failoverdomain name="nfs3-domain" nofailback="1" ordered="1" restricted="1">
    				<failoverdomainnode name="node03.arlut.utexas.edu" priority="1"/>
    				<failoverdomainnode name="node01.arlut.utexas.edu" priority="2"/>
    				<failoverdomainnode name="node02.arlut.utexas.edu" priority="3"/>
    			</failoverdomain>
    		</failoverdomains>
    		<resources>
    			<ip address="10.8.3.39" monitor_link="1"/>
    			<ip address="10.8.3.40" monitor_link="1"/>
    			<ip address="10.8.3.41" monitor_link="1"/>
    			<fs device="/dev/cvg/data01" force_fsck="0" force_unmount="1" fsid="34791" fstype="ext3" mountpoint="/lvm/data-1" name="cvg-data01" self_fence="0"/>
    			<fs device="/dev/cvg/data02" force_fsck="0" force_unmount="1" fsid="64936" fstype="ext3" mountpoint="/lvm/data-2" name="cvg-data02" self_fence="0"/>
    			<fs device="/dev/cvg/data03" force_fsck="0" force_unmount="1" fsid="22685" fstype="ext3" mountpoint="/lvm/data-3" name="cvg-data03" self_fence="0"/>
    			<fs device="/dev/cvg/data04" force_fsck="0" force_unmount="1" fsid="4676"  fstype="ext3" mountpoint="/lvm/data-4" name="cvg-data04" self_fence="0"/>
    			<fs device="/dev/cvg/home01" force_fsck="0" force_unmount="1" fsid="38301" fstype="ext3" mountpoint="/lvm/home-1" name="cvg-home01" self_fence="0"/>
    			<fs device="/dev/cvg/home02" force_fsck="0" force_unmount="1" fsid="50361" fstype="ext3" mountpoint="/lvm/home-2" name="cvg-home02" self_fence="0"/>
    			<fs device="/dev/cvg/home03" force_fsck="0" force_unmount="1" fsid="62641" fstype="ext3" mountpoint="/lvm/home-3" name="cvg-home03" self_fence="0"/>
    			<fs device="/dev/cvg/home04" force_fsck="0" force_unmount="1" fsid="24850" fstype="ext3" mountpoint="/lvm/home-4" name="cvg-home04" self_fence="0"/>
    			<nfsclient allow_recover="1" name="global-nfs-clients" options="rw,insecure" target="10.0.0.0/8"/>
    			<nfsclient allow_recover="1" name="local-nfs-clients"  options="rw,insecure" target="10.8.0.0/16"/>
    		</resources>
    		<service autostart="1" domain="nfs1-domain" exclusive="0" name="nfs1-svc" nfslock="1" recovery="relocate">
    			<ip ref="10.8.3.39">
    				<fs __independent_subtree="1" ref="cvg-data01">
    					<nfsexport name="nfs-data01">
    						<nfsclient ref="local-nfs-clients"/>
    					</nfsexport>
    				</fs>
    				<fs __independent_subtree="1" ref="cvg-home01">
    					<nfsexport name="nfs-home01">
    						<nfsclient ref="global-nfs-clients"/>
    					</nfsexport>
    				</fs>
    				<fs __independent_subtree="1" ref="cvg-home02">
    					<nfsexport name="nfs-home02">
    						<nfsclient ref="global-nfs-clients"/>
    					</nfsexport>
    				</fs>
    			</ip>
    		</service>
    		<service autostart="1" domain="nfs2-domain" exclusive="0" name="nfs2-svc" nfslock="1" recovery="relocate">
    			<ip ref="10.8.3.40">
    				<fs __independent_subtree="1" ref="cvg-data02">
    					<nfsexport name="nfs-data02">
    						<nfsclient name=" " ref="local-nfs-clients"/>
    					</nfsexport>
    				</fs>
    				<fs __independent_subtree="1" ref="cvg-data03">
    					<nfsexport name="nfs-data03">
    						<nfsclient name=" " ref="local-nfs-clients"/>
    					</nfsexport>
    				</fs>
    				<fs __independent_subtree="1" ref="cvg-home03">
    					<nfsexport name="nfs-home03">
    						<nfsclient name=" " ref="global-nfs-clients"/>
    					</nfsexport>
    				</fs>
    			</ip>
    		</service>
    		<service autostart="1" domain="nfs3-domain" exclusive="0" name="nfs3-svc" nfslock="1" recovery="relocate">
    			<ip ref="10.8.3.41">
    				<fs __independent_subtree="1" ref="cvg-data04">
    					<nfsexport name="nfs-data04">
    						<nfsclient name=" " ref="local-nfs-clients"/>
    					</nfsexport>
    				</fs>
    				<fs __independent_subtree="1" ref="cvg-home04">
    					<nfsexport name="nfs-home04">
    						<nfsclient name=" " ref="global-nfs-clients"/>
    					</nfsexport>
    				</fs>
    			</ip>
    		</service>
    	</rm>
    </cluster>

On 08/30/2012 11:00 AM,Colin Simpson <Colin.Simpson at iongeo.com> wrote:
> Send Linux-cluster mailing list submissions to
> 	linux-cluster at redhat.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://www.redhat.com/mailman/listinfo/linux-cluster
> or, via email, send a message with subject or body 'help' to
> 	linux-cluster-request at redhat.com
>
> You can reach the person managing the list at
> 	linux-cluster-owner at redhat.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Linux-cluster digest..."
>
>
> Today's Topics:
>
>     1. Re: RHEL/CentOS-6 HA NFS Configuration Question (Colin Simpson)
>     2. Re: RHEL/CentOS-6 HA NFS Configuration Question
>        (Fabio M. Di Nitto)
>     3. Re: Problems with relocation of apache and	fence_vmware
>        (PARAM KRISH)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 30 Aug 2012 10:39:03 +0000
> From: Colin Simpson<Colin.Simpson at iongeo.com>
> To: "Fabio M. Di Nitto"<fdinitto at redhat.com>
> Cc: "linux-cluster at redhat.com"<linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] RHEL/CentOS-6 HA NFS Configuration
> 	Question
> Message-ID:<1346323144.10055.12.camel at bhac.iouk.ioroot.tld>
> Content-Type: text/plain; charset="utf-8"
>
> Did this fix make it as yet?
>
> Thanks
>
> Colin
>
> On Thu, 2012-05-17 at 11:57 +0200, Fabio M. Di Nitto wrote:
>> Hi Colin,
>>
>> On 5/17/2012 11:47 AM, Colin Simpson wrote:
>>> Thanks for all the useful information on this.
>>>
>>> I realise the bz is not for this issue, I just included it as it has the
>>> suggestion that nfsd should actually live in user space (which seems
>>> sensible).
>> Understood. I can?t really say if userland or kernel would make any
>> difference in this specific unmount issue, but for "safety reasons" I
>> need to assume their design is the same and behave the same way. when/if
>> there will be a switch, we will need to look more deeply into it. With
>> current kernel implementation we (cluster guys) need to use this approach.
>>
>>> Out of interest is there a bz # for this issue?
>> Yes one for rhel5 and one for rhel6, but they are both private at the
>> moment because they have customer data in it.
>>
>> I expect that the workaround/fix (whatever you want to label it) will be
>> available via RHN in 2/3 weeks.
>>
>> Fabio
>>
>>> Colin
>>>
>>>
>>> On Thu, 2012-05-17 at 10:26 +0200, Fabio M. Di Nitto wrote:
>>>> On 05/16/2012 08:19 PM, Colin Simpson wrote:
>>>>> This is interesting.
>>>>>
>>>>> We very often see the filesystems fail to umount on busy clustered NFS
>>>>> servers.
>>>> Yes, I am aware the issue since I have been investigating it in details
>>>> for the past couple of weeks.
>>>>
>>>>> What is the nature of the "real fix"?
>>>> First, the bz you mention below is unrelated to the unmount problem we
>>>> are discussing. clustered nfsd locks are a slightly different story.
>>>>
>>>> There are two issues here:
>>>>
>>>> 1) cluster users expectations
>>>> 2) nfsd internal design
>>>>
>>>> (and note I am not blaming either cluster or nfsd here)
>>>>
>>>> Generally cluster users expect to be able to do things like (fake meta
>>>> config):
>>>>
>>>> <service1..
>>>>   <fs1..
>>>>    <nfsexport1..
>>>>     <nfsclient1..
>>>>      <ip1..
>>>> ....
>>>> <service2
>>>>   <fs2..
>>>>    <nfsexport2..
>>>>     <nfsclient2..
>>>>      <ip2..
>>>>
>>>> and be able to move services around cluster nodes without problem. Note
>>>> that it is irrelevant of the fs used. It can be clustered or not.
>>>>
>>>> This setup does unfortunately clash with nfsd design.
>>>>
>>>> When shutdown of a service happens (due to stop or relocation is
>>>> indifferent):
>>>>
>>>> ip is removed
>>>> exportfs -u .....
>>>> (and that's where we hit the nfsd design limitation)
>>>> umount fs..
>>>>
>>>> By design (tho I can't say exactly why it is done this way without
>>>> speculating), nfsd will continue to serve open sessions via rpc.
>>>> exportfs -u will only stop new incoming requests.
>>>>
>>>> If nfsd is serving a client, it will continue to hold a lock on the
>>>> filesystem (in kernel) that would prevent the fs to be unmounted.
>>>>
>>>> The only way to effectively close the sessions are:
>>>>
>>>> - drop the VIP and wait for connections timeout (nfsd would effectively
>>>>    also drop the lock on the fs) but it is slow and not always consistent
>>>>    on how long it would take
>>>>
>>>> - restart nfsd.
>>>>
>>>>
>>>> The "real fix" here would be to wait for nfsd containers that do support
>>>> exactly this scenario. Allowing unexport of single fs and lock drops
>>>> etc. etc. This work is still in very early stages upstream, that doesn't
>>>> make it suitable yet for production.
>>>>
>>>> The patch I am working on, is basically a way to handle the clash in the
>>>> best way as possible.
>>>>
>>>> A new nfsrestart="" option will be added to both fs and clusterfs, that,
>>>> if the filesystem cannot be unmounted, if force_unmount is set, it will
>>>> perform an extremely fast restart of nfslock and nfsd.
>>>>
>>>> We can argue that it is not the final solution, i think we can agree
>>>> that it is more of a workaround, but:
>>>>
>>>> 1) it will allow service migration instead of service failure
>>>> 2) it will match cluster users expectations (allowing different exports
>>>> and live peacefully together).
>>>>
>>>> The only negative impact that we have been able to evaluate so far (the
>>>> patch is still under heavy testing phase), beside having to add a config
>>>> option to enable it, is that there will be a small window in which all
>>>> clients connect to a certain node for all nfs services, will not be
>>>> served because nfsd is restarting.
>>>>
>>>> So if you are migrating export1 and there are clients using export2,
>>>> export2 will also be affected for those few ms required to restart nfsd.
>>>> (assuming export1 and 2 are running on the same node of course).
>>>>
>>>> Placing things in perspective for a cluster, I think that it is a lot
>>>> better to be able to unmount a fs and relocate services as necessary vs
>>>> a service failing completely and maybe node being fenced.
>>>>
>>>>
>>>>
>>>>
>>>>> I like the idea of NFSD fully being in user space, so killing it would
>>>>> definitely free the fs.
>>>>>
>>>>> Alan Brown (who's on this list) recently posted to a RH BZ that he was
>>>>> one of the people who moved it into kernel space for performance reasons
>>>>> in the past (that are no longer relevant):
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=580863#c9
>>>>>
>>>>> , but I doubt this is the fix you have in mind.
>>>> No that's a totally different issue.
>>>>
>>>>> Colin
>>>>>
>>>>> On Tue, 2012-05-15 at 20:21 +0200, Fabio M. Di Nitto wrote:
>>>>>> This solves different issues at startup, relocation and recovery
>>>>>>
>>>>>> Also note that there is known limitation in nfsd (both rhel5/6) that
>>>>>> could cause some problems in some conditions in your current
>>>>>> configuration. A permanent fix is being worked on atm.
>>>>>>
>>>>>> Without extreme details, you might have 2 of those services running on
>>>>>> the same node and attempting to relocate one of them can fail because
>>>>>> the fs cannot be unmounted. This is due to nfsd holding a lock (at
>>>>>> kernel level) to the FS. Changing config to the suggested one, mask the
>>>>>> problem pretty well, but more testing for a real fix is in progress.
>>>>>>
>>>>>> Fabio
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>> ________________________________
>>>>>
>>>>>
>>>>> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> ________________________________
>>>
>>>
>>> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
>>>
>
> ________________________________
>
>
> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 30 Aug 2012 12:54:53 +0200
> From: "Fabio M. Di Nitto"<fdinitto at redhat.com>
> To: Colin Simpson<Colin.Simpson at iongeo.com>
> Cc: "linux-cluster at redhat.com"<linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] RHEL/CentOS-6 HA NFS Configuration
> 	Question
> Message-ID:<503F467D.7020108 at redhat.com>
> Content-Type: text/plain; charset=UTF-8
>
> Hi Colin,
>
> the fix is out for rhel5.8.z in rgmanager-2.0.52-28.el5_8.2 and/or higher.
>
> rhel6.4 fix has been built but not verified by our QA team yet.
>
> Fabio
>
> On 8/30/2012 12:39 PM, Colin Simpson wrote:
>> Did this fix make it as yet?
>>
>> Thanks
>>
>> Colin
>>
>> On Thu, 2012-05-17 at 11:57 +0200, Fabio M. Di Nitto wrote:
>>> Hi Colin,
>>>
>>> On 5/17/2012 11:47 AM, Colin Simpson wrote:
>>>> Thanks for all the useful information on this.
>>>>
>>>> I realise the bz is not for this issue, I just included it as it has the
>>>> suggestion that nfsd should actually live in user space (which seems
>>>> sensible).
>>> Understood. I can?t really say if userland or kernel would make any
>>> difference in this specific unmount issue, but for "safety reasons" I
>>> need to assume their design is the same and behave the same way. when/if
>>> there will be a switch, we will need to look more deeply into it. With
>>> current kernel implementation we (cluster guys) need to use this approach.
>>>
>>>> Out of interest is there a bz # for this issue?
>>> Yes one for rhel5 and one for rhel6, but they are both private at the
>>> moment because they have customer data in it.
>>>
>>> I expect that the workaround/fix (whatever you want to label it) will be
>>> available via RHN in 2/3 weeks.
>>>
>>> Fabio
>>>
>>>> Colin
>>>>
>>>>
>>>> On Thu, 2012-05-17 at 10:26 +0200, Fabio M. Di Nitto wrote:
>>>>> On 05/16/2012 08:19 PM, Colin Simpson wrote:
>>>>>> This is interesting.
>>>>>>
>>>>>> We very often see the filesystems fail to umount on busy clustered NFS
>>>>>> servers.
>>>>> Yes, I am aware the issue since I have been investigating it in details
>>>>> for the past couple of weeks.
>>>>>
>>>>>> What is the nature of the "real fix"?
>>>>> First, the bz you mention below is unrelated to the unmount problem we
>>>>> are discussing. clustered nfsd locks are a slightly different story.
>>>>>
>>>>> There are two issues here:
>>>>>
>>>>> 1) cluster users expectations
>>>>> 2) nfsd internal design
>>>>>
>>>>> (and note I am not blaming either cluster or nfsd here)
>>>>>
>>>>> Generally cluster users expect to be able to do things like (fake meta
>>>>> config):
>>>>>
>>>>> <service1..
>>>>>   <fs1..
>>>>>    <nfsexport1..
>>>>>     <nfsclient1..
>>>>>      <ip1..
>>>>> ....
>>>>> <service2
>>>>>   <fs2..
>>>>>    <nfsexport2..
>>>>>     <nfsclient2..
>>>>>      <ip2..
>>>>>
>>>>> and be able to move services around cluster nodes without problem. Note
>>>>> that it is irrelevant of the fs used. It can be clustered or not.
>>>>>
>>>>> This setup does unfortunately clash with nfsd design.
>>>>>
>>>>> When shutdown of a service happens (due to stop or relocation is
>>>>> indifferent):
>>>>>
>>>>> ip is removed
>>>>> exportfs -u .....
>>>>> (and that's where we hit the nfsd design limitation)
>>>>> umount fs..
>>>>>
>>>>> By design (tho I can't say exactly why it is done this way without
>>>>> speculating), nfsd will continue to serve open sessions via rpc.
>>>>> exportfs -u will only stop new incoming requests.
>>>>>
>>>>> If nfsd is serving a client, it will continue to hold a lock on the
>>>>> filesystem (in kernel) that would prevent the fs to be unmounted.
>>>>>
>>>>> The only way to effectively close the sessions are:
>>>>>
>>>>> - drop the VIP and wait for connections timeout (nfsd would effectively
>>>>>    also drop the lock on the fs) but it is slow and not always consistent
>>>>>    on how long it would take
>>>>>
>>>>> - restart nfsd.
>>>>>
>>>>>
>>>>> The "real fix" here would be to wait for nfsd containers that do support
>>>>> exactly this scenario. Allowing unexport of single fs and lock drops
>>>>> etc. etc. This work is still in very early stages upstream, that doesn't
>>>>> make it suitable yet for production.
>>>>>
>>>>> The patch I am working on, is basically a way to handle the clash in the
>>>>> best way as possible.
>>>>>
>>>>> A new nfsrestart="" option will be added to both fs and clusterfs, that,
>>>>> if the filesystem cannot be unmounted, if force_unmount is set, it will
>>>>> perform an extremely fast restart of nfslock and nfsd.
>>>>>
>>>>> We can argue that it is not the final solution, i think we can agree
>>>>> that it is more of a workaround, but:
>>>>>
>>>>> 1) it will allow service migration instead of service failure
>>>>> 2) it will match cluster users expectations (allowing different exports
>>>>> and live peacefully together).
>>>>>
>>>>> The only negative impact that we have been able to evaluate so far (the
>>>>> patch is still under heavy testing phase), beside having to add a config
>>>>> option to enable it, is that there will be a small window in which all
>>>>> clients connect to a certain node for all nfs services, will not be
>>>>> served because nfsd is restarting.
>>>>>
>>>>> So if you are migrating export1 and there are clients using export2,
>>>>> export2 will also be affected for those few ms required to restart nfsd.
>>>>> (assuming export1 and 2 are running on the same node of course).
>>>>>
>>>>> Placing things in perspective for a cluster, I think that it is a lot
>>>>> better to be able to unmount a fs and relocate services as necessary vs
>>>>> a service failing completely and maybe node being fenced.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I like the idea of NFSD fully being in user space, so killing it would
>>>>>> definitely free the fs.
>>>>>>
>>>>>> Alan Brown (who's on this list) recently posted to a RH BZ that he was
>>>>>> one of the people who moved it into kernel space for performance reasons
>>>>>> in the past (that are no longer relevant):
>>>>>>
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=580863#c9
>>>>>>
>>>>>> , but I doubt this is the fix you have in mind.
>>>>> No that's a totally different issue.
>>>>>
>>>>>> Colin
>>>>>>
>>>>>> On Tue, 2012-05-15 at 20:21 +0200, Fabio M. Di Nitto wrote:
>>>>>>> This solves different issues at startup, relocation and recovery
>>>>>>>
>>>>>>> Also note that there is known limitation in nfsd (both rhel5/6) that
>>>>>>> could cause some problems in some conditions in your current
>>>>>>> configuration. A permanent fix is being worked on atm.
>>>>>>>
>>>>>>> Without extreme details, you might have 2 of those services running on
>>>>>>> the same node and attempting to relocate one of them can fail because
>>>>>>> the fs cannot be unmounted. This is due to nfsd holding a lock (at
>>>>>>> kernel level) to the FS. Changing config to the suggested one, mask the
>>>>>>> problem pretty well, but more testing for a real fix is in progress.
>>>>>>>
>>>>>>> Fabio
>>>>>>>
>>>>>>> --
>>>>>>> Linux-cluster mailing list
>>>>>>> Linux-cluster at redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>> ________________________________
>>>>>>
>>>>>>
>>>>>> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>> ________________________________
>>>>
>>>>
>>>> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
>>>>
>>
>> ________________________________
>>
>>
>> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
>>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 30 Aug 2012 20:58:17 +0530
> From: PARAM KRISH<mkparam at gmail.com>
> To: Digimer<lists at alteeve.ca>, linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] Problems with relocation of apache and
> 	fence_vmware
> Message-ID:
> 	<CAA1zgja0QfxzTuTJS_6-5tvjkmxAvb=qCp_Poj+cGGG4a+_UuQ at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Never mind. I am good now. I have figured out the syntax for fence_vmware
> and it works beautifully now.
>
> Here it is, just in case someone breaks his head to get this done in future
> ..
>
> ...
>          <clusternodes>
>                  <clusternode name="node1.localdomain" nodeid="1" votes="1">
>                          <fence>
>                                  <method name="fence_vmware">
>                                          <device name="vmware"
> port="node1.localdomain"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>                  <clusternode name="node2.localdomain" nodeid="2" votes="1">
>                          <fence>
>                                  <method name="fence_vmware">
>                                          <device name="vmware"
> port="node2.localdomain"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
> ..
>          <fencedevices>
>                  <fencedevice agent="fence_vmware" ipaddr="a.b.c.d"
> login="xxx" name="vmware" passwd="xxx"/>
>          </fencedevices>
> ...
>
> I will be doing some series of fail-over scenarios ( node and service
> failures have worked very well so far) and will get back with the results
> if there are any concerns. Thanks for helping me thus far. I really
> appreciate.
>
> Param
>
>
> On Thu, Aug 30, 2012 at 1:37 PM, PARAM KRISH<mkparam at gmail.com>  wrote:
>
>> *Background : *
>> I am using two VM's hosted in my internal lab that has two interfaces one
>> configured with a valid IP and other being down. I have kept the VIP also
>> in the same network. My intention is to have a Apache configured as cluster
>> service in these two nodes and do a fail-over when the node or the
>> interface goes down. I try to use fence_vmware as fencing device. These two
>> VM's are now part of a ESX 4.1 host and the GuestOS in my VM's are RHEL6.0
>> 32-bit.
>>
>>
>> I am seeing the following problems in my setup now ...
>>
>> 1. When starting a apache service from LUCI, it starts fine in a node.
>> But, if i kill httpd process from that node manually, it does not detect
>> the service is down to restart or to relocate
>> 2. -same- case if i do "ip adds del<VIP>" ; it just detects the node is
>> down but does not do a restart or relocate of the service
>> 3. Whenever i reboot the nodes, it comes online and the service properly
>> starts fine in either of the node and both nodes perfectly in Quorum but
>> the fail-over never happens if i stop that active node.
>> 4. I am not sure what format of fence that i must put in the cluster.conf,
>> since there is no way i can test that out if at all it works fine.
>>
>> Manual tests :
>> 1. I manually run something like this
>> "fence_vmware --action=status --ip=10.72.145.145 --username=<login>
>> --password=<password>  --plug=<vm-name>" which works fine on both the nodes.
>> 2. Apache starts/stops just particularly fine from both nodes when i do
>> "rg_test test /etc/cluster/cluster.conf start service WEB"
>>
>> Cluster.conf is attached herewith.
>> rgmanager.log is attached herewith.
>>
>> Please let me know any specific debug commands that i can run manually to
>> find out the issues going on here, more particularly the "relocation" of
>> service and the "fencing"; both consistently fails.
>>
>> Please help. I have been spending more than 10 days now to set this up in
>> my internal lab to show it as Proof of Concept to my business heads to buy
>> RHEL cluster indeed works for our production requirement.
>>
>> -Param
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:<https://www.redhat.com/archives/linux-cluster/attachments/20120830/2a00c23c/attachment.html>
>
> ------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> End of Linux-cluster Digest, Vol 100, Issue 42
> **********************************************


-- 
Randy Zagar                               Sr. Unix Systems Administrator
E-mail: zagar at arlut.utexas.edu            Applied Research Laboratories
Phone: 512 835-3131                       Univ. of Texas at Austin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120905/c374eb91/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 9129 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120905/c374eb91/attachment.p7s>

From CBurke at innova-partners.com  Wed Sep  5 16:21:57 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Wed, 5 Sep 2012 16:21:57 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <20120905133432.GC5074@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C3541A@alexandria.innova.local>

This gives me the same behavior. On each node I ran:


#rm -f /var/lib/ricci/certs/clients/*

#service ricci restart

Then from one node I simply incremented the version in cluster.conf and ran

# cman_tool version -r
You have not authenticated to the ricci daemon on xanadunode1
Password: 
You have not authenticated to the ricci daemon on xanadunode2
Password: 
You have not authenticated to the ricci daemon on xanadunode3
Password: 
The connection to xanadunode1 died unexpectedly
The connection to xanadunode2 died unexpectedly
The connection to xanadunode3 died unexpectedly
cman_tool: ccs_sync failed.
If you have distributed the config file yourself, try re-running with -S


Let me know if there is any other info I can give you.

Thanks again!

________________________________________
Chip Burke


On 9/5/12 9:34 AM, "Jan Pokorn?" <jpokorny at redhat.com> wrote:

>On 20/08/12 14:51 +0000, Chip Burke wrote:
>> Thanks for sticking with me on this.
>
>Sorry for delay.
>
>The traceback from your initial email, which ended with:
>> TypeError: No object (name: translator) has been registered for this
>> thread
>
>I overlooked this one, but independently hit it too [1].
>
>But this is orthogonal to the main issue you are stating
>and which I am having troubles to figure out.
>
>Could you please try following as a possible workaround?
>At the particular node (or all within the cluster), try running
>rm -f /var/lib/ricci/certs/clients/* and restarting ricci.
>As a consequence, you may be prompted for the password to authenticate
>against ricci instance(s) and/or re-add the cluster in luci even if
>this was already done before.
>
>[1] https://bugzilla.redhat.com/show_bug.cgi?id=853151
>
>Thanks,
>Jan
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From ajb2 at mssl.ucl.ac.uk  Wed Sep  5 16:48:27 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Wed, 05 Sep 2012 17:48:27 +0100
Subject: [Linux-cluster] RHEL/CentOS-6 HA NFS Configuration Question
In-Reply-To: <504768BF.8050500@arlut.utexas.edu>
References: <mailman.39.1346342403.18903.linux-cluster@redhat.com>
	<504768BF.8050500@arlut.utexas.edu>
Message-ID: <5047825B.8060302@mssl.ucl.ac.uk>

On 05/09/12 15:59, Randy Zagar wrote:

> What I don't understand is what changed between RHEL-5 and RHEL-6 that
> has made HA NFS failover so difficult?

HA NFS failover has always been difficult for a number of reasons mostly 
related to how abysmal the Linux NFS implementation is.

> I have been running a 3-node CentOS-5 cluster serving home directories
> (via NFS) to 200+ users for several years now and have been able to fail
> over home directories without significant issues.
>
> With CentOS-5, most of the time I am able to avoid stale filehandle
> issues.

You can avoid those almost completely if you modify
/usr/share/cluster/nfsclient.sh to include flocks on all exportfs 
statements.

The alternative is to replace /usr/sbin/exportfs with a flock wrapper 
(which may well be cleaner/safer)

The problem is that exportfs is not multi-instance aware and if there 
are multiple copies of it running you end up with a race condition which 
can result in export tables being clobbered if you have multiple NFS 
exports.

Quite simply: Linux NFS code is as crufty as hell and nothing short of a 
proper cleanroom rewrite will fix it. Even NFSv4 code copies in 
mountains of 20-30 year old code which has never been properly vetted.

I can post a diff for the script if you'd like it. We notified Redhat 
about this issue years ago (and the fix) but they still haven't gotten 
around to including it in the official packages.

> So far, I've been unwilling to use CentOS-6 for HA NFS as I
> can't get failover to work properly.

RHEL6 NFS failover seems to work on NFSv3 for me, but I haven't tested 
it under production loads yet - when we moved from EL4 to EL5 everything 
broke when production loads were applied (The EL4-5 changeover was 
performed overnight by a Redhat engineer at a cost of several thousand 
dollars as we wished to avoid trouble - Everything crashed and burned 
the morning after.)


From jpokorny at redhat.com  Thu Sep  6 10:11:02 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Thu, 6 Sep 2012 12:11:02 +0200
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C3541A@alexandria.innova.local>
References: <20120905133432.GC5074@redhat.com>
	<F6CC886A516DF049A33F32C149ED21C823C3541A@alexandria.innova.local>
Message-ID: <20120906101102.GA19777@redhat.com>

On 05/09/12 16:21 +0000, Chip Burke wrote:
> This gives me the same behavior.

Sorry, I can now see it was a bad workaround guess from the beginning.

I think the strace logs you provided contain good-enough information
about the issue and still scratching my head.

Part of it is that there are two sub-issues and I am not sure if
they are isolated or there is a causality relationship.


The first one is hidden and very probably innocent -- the one present
in ricci.strace.5573 -- EPIPE/SIGPIPE.  First, there is an extra empty
read because (I think) the client of ricci has shutdown the connection
first (seems like ungraceful way of disconnection, but still tolerable),
but ricci side, despite this fact, is trying to send closure notify
message so as to achieve expected graceful disconnection.
Apparently, this fails in this case, accompanied by EPIPE/SIGPIPE.


However, the second is one -- unability to proceed the request, failing
upon timeout as can be seen in ricci.strace.5575 -- is severe.
> read(5, "\27\3\1\0p", 5)                = 5
> read(5, "\373\303\16\20>\202%\34\211\214b\\l\260\354\3662\312\272\21<\t\r\235S\31o\361\21\265\266p"..., 112) = 112
Here the two trailing bytes out of first five (0x00 0x70) indicates the
whole size of the message that is indeed read as expected (112).
Ricci should *not* keep trying to read pass this point as the whole XML
message should have been received at this moment.  But for some
reason it does (see subsequent poll with POLLIN flag).

The easiest explanation is that this XML is not well-formed, which
would boil down to your obfuscated password (not offending it,
it's highly reasonable).  Did you password contain any XML-nonfriendly
character, such as one of '<>"&'?  If so, could you please try digits,
ASCII letters and surely-safe characters only (dot, dash, etc.)?


As outlined, these two issues can be even interconnected (having
OpenSSL error queue at the main thread, which does not get cleared
explictly as it probably should, in mind).  I am going to look more
into it (perhaps put together simple client for you to try) after
knowing your situation with the password.

If there is nothing suspicious about your password to authenticate
against ricci, the inverse of previously suggested workaround could
be tried (manually pre-authenticating ccs against ricci);
from the host ccs is being run at, something along the lines:

$ ccs ...  # if ~/.ccs/cacert.pem does not exist yet
$ RICCIHOST=machina
$ RICCICERT=$(mktemp -u /var/lib/ricci/certs/clients/client_cert_XXXXXX)
$ scp ~/.ccs/cacert.pem root@$RICCIHOST:$RICCICERT
$ ssh root@$RICCIHOST chown ricci:root $RICCICERT
$ ssh root@$RICCIHOST restorecon $RICCICERT  # if using SELinux
$ ssh root@$RICCIHOST service ricci restart
$ ccs ...

Thanks,
Jan


From CBurke at innova-partners.com  Thu Sep  6 15:11:17 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Thu, 6 Sep 2012 15:11:17 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <20120906101102.GA19777@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C357E7@alexandria.innova.local>

Well that was an easy enough fix finally. I thought perhaps the password
for the VMWare fence account was the issue and updated cluster.conf with a
place holder password of 'password'. Ricci would not work. So I updated
the actual ricci user account to use a password of 'password' and
restarted Ricci on all of the nodes. Ricci now works. So indeed, it
certainly did not like a character in the password I was using which was
65peC&E$taFRE&U. In all likelihood the & was the problem character. To
confirm that hypothesis, I changed the Ricci password to 65peC&E$taFREU
and everything still worked as expected. So there is our answer. From your
stand point I don't know if that needs to be coded around or what, but at
least we know how to reproduce the issue.

Thanks again for sticking with me on this even if the cause was somewhat
silly.


________________________________________
Chip Burke


On 9/6/12 6:11 AM, "Jan Pokorn?" <jpokorny at redhat.com> wrote:

>On 05/09/12 16:21 +0000, Chip Burke wrote:
>> This gives me the same behavior.
>
>Sorry, I can now see it was a bad workaround guess from the beginning.
>
>I think the strace logs you provided contain good-enough information
>about the issue and still scratching my head.
>
>Part of it is that there are two sub-issues and I am not sure if
>they are isolated or there is a causality relationship.
>
>
>The first one is hidden and very probably innocent -- the one present
>in ricci.strace.5573 -- EPIPE/SIGPIPE.  First, there is an extra empty
>read because (I think) the client of ricci has shutdown the connection
>first (seems like ungraceful way of disconnection, but still tolerable),
>but ricci side, despite this fact, is trying to send closure notify
>message so as to achieve expected graceful disconnection.
>Apparently, this fails in this case, accompanied by EPIPE/SIGPIPE.
>
>
>However, the second is one -- unability to proceed the request, failing
>upon timeout as can be seen in ricci.strace.5575 -- is severe.
>> read(5, "\27\3\1\0p", 5)                = 5
>> read(5, 
>>"\373\303\16\20>\202%\34\211\214b\\l\260\354\3662\312\272\21<\t\r\235S\31
>>o\361\21\265\266p"..., 112) = 112
>Here the two trailing bytes out of first five (0x00 0x70) indicates the
>whole size of the message that is indeed read as expected (112).
>Ricci should *not* keep trying to read pass this point as the whole XML
>message should have been received at this moment.  But for some
>reason it does (see subsequent poll with POLLIN flag).
>
>The easiest explanation is that this XML is not well-formed, which
>would boil down to your obfuscated password (not offending it,
>it's highly reasonable).  Did you password contain any XML-nonfriendly
>character, such as one of '<>"&'?  If so, could you please try digits,
>ASCII letters and surely-safe characters only (dot, dash, etc.)?
>
>
>As outlined, these two issues can be even interconnected (having
>OpenSSL error queue at the main thread, which does not get cleared
>explictly as it probably should, in mind).  I am going to look more
>into it (perhaps put together simple client for you to try) after
>knowing your situation with the password.
>
>If there is nothing suspicious about your password to authenticate
>against ricci, the inverse of previously suggested workaround could
>be tried (manually pre-authenticating ccs against ricci);
>from the host ccs is being run at, something along the lines:
>
>$ ccs ...  # if ~/.ccs/cacert.pem does not exist yet
>$ RICCIHOST=machina
>$ RICCICERT=$(mktemp -u /var/lib/ricci/certs/clients/client_cert_XXXXXX)
>$ scp ~/.ccs/cacert.pem root@$RICCIHOST:$RICCICERT
>$ ssh root@$RICCIHOST chown ricci:root $RICCICERT
>$ ssh root@$RICCIHOST restorecon $RICCICERT  # if using SELinux
>$ ssh root@$RICCIHOST service ricci restart
>$ ccs ...
>
>Thanks,
>Jan
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From jpokorny at redhat.com  Thu Sep  6 18:32:31 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Thu, 6 Sep 2012 20:32:31 +0200
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C357E7@alexandria.innova.local>
References: <20120906101102.GA19777@redhat.com>
	<F6CC886A516DF049A33F32C149ED21C823C357E7@alexandria.innova.local>
Message-ID: <20120906183231.GA3521@redhat.com>

On 06/09/12 15:11 +0000, Chip Burke wrote:
> Well that was an easy enough fix finally. I thought perhaps the password
> for the VMWare fence account was the issue and updated cluster.conf with a
> place holder password of 'password'. Ricci would not work. So I updated
> the actual ricci user account to use a password of 'password' and
> restarted Ricci on all of the nodes. Ricci now works. So indeed, it
> certainly did not like a character in the password I was using which was
> 65peC&E$taFRE&U. In all likelihood the & was the problem character.

Bull's eye.

> On 9/6/12 6:11 AM, "Jan Pokorn?" <jpokorny at redhat.com> wrote:
>> The easiest explanation is that this XML is not well-formed, which
>> would boil down to your obfuscated password (not offending it,
>> it's highly reasonable).  Did you password contain any XML-nonfriendly
>> character, such as one of '<>"&'?  If so, could you please try digits,
>> ASCII letters and surely-safe characters only (dot, dash, etc.)?

Admittedly, this obstacle should be easier to track down, if allowed
to exist at all (see bellow).

> To confirm that hypothesis, I changed the Ricci password to 65peC&E$taFREU
> and everything still worked as expected.

Once at it, it should have been "65peCE$taFREU" (no & char), shouldn't
it?

> From your stand point I don't know if that needs to be coded around or what,
> but at least we know how to reproduce the issue.

Thanks for bringing up this part we should be more careful about.
As a starter, I filed these bugs:

- ricci (needs to understand the XML entities properly)
  https://bugzilla.redhat.com/show_bug.cgi?id=855121

(clients need to do a proper encoding into XML entities)
- luci:     https://bugzilla.redhat.com/show_bug.cgi?id=855112
- ccs:      https://bugzilla.redhat.com/show_bug.cgi?id=855117
- ccs_sync: https://bugzilla.redhat.com/show_bug.cgi?id=855120

Also based on studying some relevant parts of the ricci's code,
I've added a few private suggestions under the umbrella of bug 849233.

> Thanks again for sticking with me on this even if the cause was somewhat
> silly.

To be fair enough, no matter how unprobable the reason of not working
correctly is (let's keep complex configuration errors aside), one
can expect such things self-evident (via the messages, logs, etc.),
not as an exercise left to the user and indirectly back to the
maintainer :-)

Thanks,
Jan


From CBurke at innova-partners.com  Thu Sep  6 18:44:57 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Thu, 6 Sep 2012 18:44:57 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <20120906183231.GA3521@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C358CE@alexandria.innova.local>

I only eliminated the second & and that was all it took. My guess is this
is some edge case where the two &s in the string made it break where as a
single & did not cause things to escape or what have you.

All that said, thanks again!
________________________________________
Chip Burke


On 9/6/12 2:32 PM, "Jan Pokorn?" <jpokorny at redhat.com> wrote:

>On 06/09/12 15:11 +0000, Chip Burke wrote:
>> Well that was an easy enough fix finally. I thought perhaps the password
>> for the VMWare fence account was the issue and updated cluster.conf
>>with a
>> place holder password of 'password'. Ricci would not work. So I updated
>> the actual ricci user account to use a password of 'password' and
>> restarted Ricci on all of the nodes. Ricci now works. So indeed, it
>> certainly did not like a character in the password I was using which was
>> 65peC&E$taFRE&U. In all likelihood the & was the problem character.
>
>Bull's eye.
>
>> On 9/6/12 6:11 AM, "Jan Pokorn?" <jpokorny at redhat.com> wrote:
>>> The easiest explanation is that this XML is not well-formed, which
>>> would boil down to your obfuscated password (not offending it,
>>> it's highly reasonable).  Did you password contain any XML-nonfriendly
>>> character, such as one of '<>"&'?  If so, could you please try digits,
>>> ASCII letters and surely-safe characters only (dot, dash, etc.)?
>
>Admittedly, this obstacle should be easier to track down, if allowed
>to exist at all (see bellow).
>
>> To confirm that hypothesis, I changed the Ricci password to
>>65peC&E$taFREU
>> and everything still worked as expected.
>
>Once at it, it should have been "65peCE$taFREU" (no & char), shouldn't
>it?
>
>> From your stand point I don't know if that needs to be coded around or
>>what,
>> but at least we know how to reproduce the issue.
>
>Thanks for bringing up this part we should be more careful about.
>As a starter, I filed these bugs:
>
>- ricci (needs to understand the XML entities properly)
>  https://bugzilla.redhat.com/show_bug.cgi?id=855121
>
>(clients need to do a proper encoding into XML entities)
>- luci:     https://bugzilla.redhat.com/show_bug.cgi?id=855112
>- ccs:      https://bugzilla.redhat.com/show_bug.cgi?id=855117
>- ccs_sync: https://bugzilla.redhat.com/show_bug.cgi?id=855120
>
>Also based on studying some relevant parts of the ricci's code,
>I've added a few private suggestions under the umbrella of bug 849233.
>
>> Thanks again for sticking with me on this even if the cause was somewhat
>> silly.
>
>To be fair enough, no matter how unprobable the reason of not working
>correctly is (let's keep complex configuration errors aside), one
>can expect such things self-evident (via the messages, logs, etc.),
>not as an exercise left to the user and indirectly back to the
>maintainer :-)
>
>Thanks,
>Jan
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From CBurke at innova-partners.com  Thu Sep  6 20:45:33 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Thu, 6 Sep 2012 20:45:33 +0000
Subject: [Linux-cluster] Fence methods
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C35939@alexandria.innova.local>

Now that ricci is figured out, I am having some issues with fencing.

It seems VMWare Fence works very well, but our GFS2 volume is not available until it receives a "success" status.  This gives us maybe 30-60 seconds of time where we cannot access the GFS2 volumes which equates to downtime. SCSI Fencing seems faster, but very unreliable. If I try to fence a node, it will return "fence somenode success". Great. But the node can still access the GFS2 volume.

Then I am also seeing conflicting information on using Qdisk with fence_scsi as it seems to be a no-no. I could swear I saw a note somewhere that Qdisk and fence_scsi worked together in newer versions of RHEL.

So what is my best bet in making sure GFS2 is as available as possible in the case of a node failure? or simply rebooting a node to apply say a software patch which is an even bigger concern?


Cluster.conf as it stands now:

<?xml version="1.0"?>
<cluster config_version="34" name="Xanadu">
<clusternodes>
<clusternode name="xanadunode1" nodeid="1">
<fence>
<method name="Method2">
<device name="SCSI_Fence"/>
</method>
</fence>
<unfence>
<device action="on" name="SCSI_Fence"/>
</unfence>
</clusternode>
<clusternode name="xanadunode2" nodeid="2">
<fence>
<method name="Method2">
<device name="SCSI_Fence"/>
</method>
</fence>
<unfence>
<device action="on" name="SCSI_Fence"/>
</unfence>
</clusternode>
<clusternode name="xanadunode3" nodeid="3">
<fence>
<method name="Method2">
<device name="SCSI_Fence"/>
</method>
</fence>
<unfence>
<device action="on" name="SCSI_Fence"/>
</unfence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_vmware_soap" ipaddr="vsphere.innova.local" login="vmwarefence" name="VMWare_Fence" passwd="XXXXXXXX"/>
<fencedevice agent="fence_scsi" name="SCSI_Fence"/>
</fencedevices>
<cman expected_votes="5"/>
<quorumd label="quorum"/>
<rm>
<failoverdomains>
<failoverdomain name="Cluster Management">
<failoverdomainnode name="xanadunode1"/>
<failoverdomainnode name="xanadunode2"/>
<failoverdomainnode name="xanadunode3"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="192.168.30.78" sleeptime="2"/>
</resources>
<service domain="Cluster Management" name="Cluster Management" recovery="relocate">
<ip ref="192.168.30.78"/>
</service>
</rm>
</cluster>

________________________________________
Chip Burke


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120906/1cba5546/attachment.htm>

From rohara at redhat.com  Thu Sep  6 21:37:57 2012
From: rohara at redhat.com (Ryan O'Hara)
Date: Thu, 06 Sep 2012 16:37:57 -0500
Subject: [Linux-cluster] Fence methods
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C35939@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C35939@alexandria.innova.local>
Message-ID: <504917B5.80609@redhat.com>

On 09/06/2012 03:45 PM, Chip Burke wrote:
> Now that ricci is figured out, I am having some issues with fencing.
>
> It seems VMWare Fence works very well, but our GFS2 volume is not
> available until it receives a "success" status.  This gives us maybe
> 30-60 seconds of time where we cannot access the GFS2 volumes which
> equates to downtime. SCSI Fencing seems faster, but very unreliable.
> If I try to fence a node, it will return "fence somenode success".
> Great. But the node can still access the GFS2 volume.

Are you absolutely sure that your array supports SCSI-3 persistent 
reservations? When you start cman, you should see that unfencing occurs. 
If successful, the devices that comprise your GFS2 volume should have 
one WERO (type 5) reservation and one or more registrations.
Can you use sg_persist to verify this? Better yet, use the logfile 
option for fence_scsi:

<fencedevice agent="fence_scsi" name="SCSI_Fence" \
  logfile="/tmp/fence_scsi.log"/>

This logfile should show you what is happening when either unfencing or 
fencing occur.

Also, when you say you can "access" a GFS2 volume after fencing so you 
mean you can write to this volume? If fence_scsi is working correctly, 
that should not be possible. How exactly are you accessing the volume 
after fencing?

> Then I am also seeing conflicting information on using Qdisk with
> fence_scsi as it seems to be a no-no. I could swear I saw a note
> somewhere that Qdisk and fence_scsi worked together in newer versions
> of RHEL.

Can you direct me to the conflicting information? As long as your quorum 
device is not subject to SCSI-3 persistent reservations it should work. 
In your case, this means your quorum device must not belong to a cluster 
volume.

Ryan


From CBurke at innova-partners.com  Fri Sep  7 16:17:52 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Fri, 7 Sep 2012 16:17:52 +0000
Subject: [Linux-cluster] Fence methods
In-Reply-To: <504917B5.80609@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C35BA1@alexandria.innova.local>

Ok, there was the hole in my testing. I expected fence_scsi to prevent
writes AND reads. So reads were continuing to work which threw me off. But
as to the rest of your explanation, that is indeed how I have things
configured and you were most helpful.

Thanks!
________________________________________
Chip Burke


On 9/6/12 5:37 PM, "Ryan O'Hara" <rohara at redhat.com> wrote:

>On 09/06/2012 03:45 PM, Chip Burke wrote:
>> Now that ricci is figured out, I am having some issues with fencing.
>>
>> It seems VMWare Fence works very well, but our GFS2 volume is not
>> available until it receives a "success" status.  This gives us maybe
>> 30-60 seconds of time where we cannot access the GFS2 volumes which
>> equates to downtime. SCSI Fencing seems faster, but very unreliable.
>> If I try to fence a node, it will return "fence somenode success".
>> Great. But the node can still access the GFS2 volume.
>
>Are you absolutely sure that your array supports SCSI-3 persistent
>reservations? When you start cman, you should see that unfencing occurs.
>If successful, the devices that comprise your GFS2 volume should have
>one WERO (type 5) reservation and one or more registrations.
>Can you use sg_persist to verify this? Better yet, use the logfile
>option for fence_scsi:
>
><fencedevice agent="fence_scsi" name="SCSI_Fence" \
>  logfile="/tmp/fence_scsi.log"/>
>
>This logfile should show you what is happening when either unfencing or
>fencing occur.
>
>Also, when you say you can "access" a GFS2 volume after fencing so you
>mean you can write to this volume? If fence_scsi is working correctly,
>that should not be possible. How exactly are you accessing the volume
>after fencing?
>
>> Then I am also seeing conflicting information on using Qdisk with
>> fence_scsi as it seems to be a no-no. I could swear I saw a note
>> somewhere that Qdisk and fence_scsi worked together in newer versions
>> of RHEL.
>
>Can you direct me to the conflicting information? As long as your quorum
>device is not subject to SCSI-3 persistent reservations it should work.
>In your case, this means your quorum device must not belong to a cluster
>volume.
>
>Ryan
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From CBurke at innova-partners.com  Fri Sep  7 18:38:40 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Fri, 7 Sep 2012 18:38:40 +0000
Subject: [Linux-cluster] GFS2 blocking on one node
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C35C1B@alexandria.innova.local>

My problem is that on a single node of the cluster I can mount a GFS2 volume, however as soon as I try to write to the volume, access to GFS2 freezes on all nodes (Simple ls hangs even). The hang finally clears up with the original two nodes regaining access, but the third node vomits errors all over (See below from syslog). The other two nodes can read and write to GFS2 just fine until this node joins the cluster and tries to write. In my digging I found this thread:

https://www.redhat.com/archives/linux-cluster/2012-August/msg00142.html

But that sounds different enough that I don't think it is my issue.


Syslog bits from one of the "good" nodes:


________________________________________
Chip Burke

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120907/e52593d8/attachment.htm>

From sonredhen at gmail.com  Fri Sep  7 20:20:11 2012
From: sonredhen at gmail.com (Jason Henderson)
Date: Fri, 7 Sep 2012 16:20:11 -0400
Subject: [Linux-cluster] GFS2/DLM deadlock
Message-ID: <CAO1Tb_SimOHGV-R4Z3FHyXbRDVioihBp0XjaEZBO4GYPftYgNA@mail.gmail.com>

I have a 2 node cluster ( two HP DL360G7 servers) with a shared gfs2
file system located on an HP Modular Smart Array.
Node1 is the 'active' server and performs almost all gfs2 access.
Node2 is a 'passive' backup and rarely accesses the shared file system.

Both nodes are currently running kernel-PAE-2.6.18-274.17.1.el5.i686.

I am aware of the kernel updates available in the RedHat 5.8 release
and have reviewed the change logs and associated bug reports, that I
have access to, to determine if the handful of gfs2 changes might
apply to this situation. They do not seem to apply but we plan on
upgrading our production servers when we can to rule out that
possibility.

Intermittently (3-4 times a month) the gfs2 file system appears to
lock up and any processes attempting to access it enter D state.
Networking continues to function and openais is happy so no fencing
occurs. Power cycling the passive node breaks the deadlock and
processing on the active node will continue.

During the last hang we ran the gfs2_hangalyzer tool, suggested in
some older threads on the deadlock subject, to capture the dlm and
glock info.

I can't find explanations on what some of the fields mean so I'm
hoping someone can help me interpret the results and confirm if my
understanding of the output is correct or offer suggestions on how to
proceed debugging further when it happens again. So far we can't come
up with a reproduction scenario.

I have attached the gfs2_hangalyzer summary output as hangalyzer.txt.
I have the raw lock data as well if required.

The tool reports that there are two glocks on which processes are
waiting but no other process holds them. So it looks like a deadlock,
since if no process owns them, they should have been released.

The tool also reports that the two glocks were granted to two process IDs.

This is an excerpt from the hangalyzer output:

--------------------------------------------
There are 2 glocks with waiters.
node1, pid 5380 is waiting for glock 2/85187, but no holder was found.
         The dlm has granted lkb "       2           85187" to pid 5021


                      lkb_id N RemoteID  pid exflg lkbflgs stat gr rq
  waiting n ln             resource name
node1  : FS1:  3e00003 2  10c0002 5021     0   10000 grnt  5 -1
  0 0 24 "       2           85187"
node1  : FS1:  1501c6a 0        0 5380     0       0 wait -1  3
  0 0 24 "       2           85187"
node2  : FS1: G:  s:EX n:2/85187 f:dyq t:EX d:SH/0 l:0 a:0 r:4 m:150
node2  :                         (pending demote, dirty, holder queued)
node2  : FS1:  I: n:1711/545159 t:8 f:0x10 d:0x00000000 s:957/957

                        lkb_id N RemoteID  pid exflg lkbflgs stat gr
rq    waiting n ln             resource name
node2  : FS1:  10c0002 1  3e00003 5021     0       0 grnt  5 -1
  0 1 24 "       2           85187"
--------------------------------------------

As I understand this, on node1 the resource name "2 85187" is granted
(grnt) to process 5021 on node2 while process 5380 is in wait mode on
it.
At the same time, node2 sees that resource name "2 85187" is granted
(grnt) to process 5021 on node1.
On node1, process ID 5021 is [glock_workqueue].
>From 'ps axl':
1     0  5021    67  10  -5      0     0 worker S<   ?          0:07
[glock_workqueue]

A similar thing occurs for resource name "2 81523".

--------------------------------------------
                        lkb_id N RemoteID  pid exflg lkbflgs stat gr
rq    waiting n ln             resource name
node1  : FS1:  2f20002 2  2970001 5021    44   10000 grnt  3 -1
  0 0 24 "       2           81523"
node1  : FS1:  3961d2b 0        0 5022     0       0 wait -1  5
  0 0 24 "       2           81523"
node2  : FS1: G:  s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100
node2  :                         (pending demote, holder queued)
node2  : FS1:  I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864

                        lkb_id N RemoteID  pid exflg lkbflgs stat gr
rq    waiting n ln             resource name
node2  : FS1:  2970001 1  2f20002 5029    44       0 grnt  3 -1
  0 1 24 "       2           81523"
--------------------------------------------

On node1 the resource "2 81523" is granted to process 5021 on node2,
while local process 5022 waits on it.
On node2, the lock appears to be granted to process 5029 from node1.
On node1, process ID 5029 is [delete_workqueu].
>From 'ps axl':
1     0  5029    67  10  -5      0     0 worker S<   ?          0:00
[delete_workqueu]

Is my understanding of this output correct?
Is there more info I need to try and gather to diagnose the issue when
it happens again?
-------------- next part --------------
node1  : FS1: G:  s:UN n:2/85187 f:lq t:SH d:EX/0 l:0 a:0 r:4 m:200
node1  :                         (locked, holder queued)
node1  : FS1:  H: s:SH f:aW e:0 p:5380 [chown] gfs2_lookup+0x42/0x8e [gfs2]

                        lkb_id N RemoteID  pid exflg lkbflgs stat gr rq    waiting n ln             resource name
node1  : FS1:  3e00003 2  10c0002 5021     0   10000 grnt  5 -1          0 0 24 "       2           85187"
node1  : FS1:  1501c6a 0        0 5380     0       0 wait -1  3          0 0 24 "       2           85187"
node2  : FS1: G:  s:EX n:2/85187 f:dyq t:EX d:SH/0 l:0 a:0 r:4 m:150
node2  :                         (pending demote, dirty, holder queued)
node2  : FS1:  I: n:1711/545159 t:8 f:0x10 d:0x00000000 s:957/957

                        lkb_id N RemoteID  pid exflg lkbflgs stat gr rq    waiting n ln             resource name
node2  : FS1:  10c0002 1  3e00003 5021     0       0 grnt  5 -1          0 1 24 "       2           85187"


node1  : FS1: G:  s:UN n:2/81523 f:lqO t:EX d:EX/0 l:0 a:0 r:40 m:10
node1  :                         (locked, holder queued, callback owed)
node1  : FS1:  H: s:EX f:W e:0 p:16386 [generic_templat] gfs2_createi+0x58/0xe90 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:8036 [host_status] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:16614 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:16809 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:16834 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:17536 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:17541 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18165 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18235 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18240 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18731 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18733 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18741 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18760 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18773 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18893 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:18920 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:19583 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:19630 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:19692 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:19695 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:19714 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:19718 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:20070 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:20072 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:20421 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:20428 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:20435 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:8071 [alarm_manager] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:30239 [generic_templat] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:30373 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:30836 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:30860 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:30964 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:8624 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:20485 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]
node1  : FS1:  H: s:SH f:aW e:0 p:22548 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2]

                        lkb_id N RemoteID  pid exflg lkbflgs stat gr rq    waiting n ln             resource name
node1  : FS1:  2f20002 2  2970001 5021    44   10000 grnt  3 -1          0 0 24 "       2           81523"
node1  : FS1:  3961d2b 0        0 5022     0       0 wait -1  5          0 0 24 "       2           81523"
node2  : FS1: G:  s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100
node2  :                         (pending demote, holder queued)
node2  : FS1:  I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864

                        lkb_id N RemoteID  pid exflg lkbflgs stat gr rq    waiting n ln             resource name
node2  : FS1:  2970001 1  2f20002 5029    44       0 grnt  3 -1          0 1 24 "       2           81523"


There are 2 glocks with waiters.
node1, pid 5380 is waiting for glock 2/85187, but no holder was found.
         The dlm has granted lkb "       2           85187" to pid 5021

node1, pid 16386 is waiting for glock 2/81523, but no holder was found.
node1, pid 8036 is waiting for glock 2/81523, but no holder was found.
node1, pid 16614 is waiting for glock 2/81523, but no holder was found.
node1, pid 16809 is waiting for glock 2/81523, but no holder was found.
node1, pid 16834 is waiting for glock 2/81523, but no holder was found.
node1, pid 17536 is waiting for glock 2/81523, but no holder was found.
node1, pid 17541 is waiting for glock 2/81523, but no holder was found.
node1, pid 18165 is waiting for glock 2/81523, but no holder was found.
node1, pid 18235 is waiting for glock 2/81523, but no holder was found.
node1, pid 18240 is waiting for glock 2/81523, but no holder was found.
node1, pid 18731 is waiting for glock 2/81523, but no holder was found.
node1, pid 18733 is waiting for glock 2/81523, but no holder was found.
node1, pid 18741 is waiting for glock 2/81523, but no holder was found.
node1, pid 18760 is waiting for glock 2/81523, but no holder was found.
node1, pid 18773 is waiting for glock 2/81523, but no holder was found.
node1, pid 18893 is waiting for glock 2/81523, but no holder was found.
node1, pid 18920 is waiting for glock 2/81523, but no holder was found.
node1, pid 19583 is waiting for glock 2/81523, but no holder was found.
node1, pid 19630 is waiting for glock 2/81523, but no holder was found.
node1, pid 19692 is waiting for glock 2/81523, but no holder was found.
node1, pid 19695 is waiting for glock 2/81523, but no holder was found.
node1, pid 19714 is waiting for glock 2/81523, but no holder was found.
node1, pid 19718 is waiting for glock 2/81523, but no holder was found.
node1, pid 20070 is waiting for glock 2/81523, but no holder was found.
node1, pid 20072 is waiting for glock 2/81523, but no holder was found.
node1, pid 20421 is waiting for glock 2/81523, but no holder was found.
node1, pid 20428 is waiting for glock 2/81523, but no holder was found.
node1, pid 20435 is waiting for glock 2/81523, but no holder was found.
node1, pid 8071 is waiting for glock 2/81523, but no holder was found.
node1, pid 30239 is waiting for glock 2/81523, but no holder was found.
node1, pid 30373 is waiting for glock 2/81523, but no holder was found.
node1, pid 30836 is waiting for glock 2/81523, but no holder was found.
node1, pid 30860 is waiting for glock 2/81523, but no holder was found.
node1, pid 30964 is waiting for glock 2/81523, but no holder was found.
node1, pid 8624 is waiting for glock 2/81523, but no holder was found.
node1, pid 20485 is waiting for glock 2/81523, but no holder was found.
node1, pid 22548 is waiting for glock 2/81523, but no holder was found.
         The dlm has granted lkb "       2           81523" to pid 5029


From rpeterso at redhat.com  Fri Sep  7 20:57:51 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 7 Sep 2012 16:57:51 -0400 (EDT)
Subject: [Linux-cluster] GFS2/DLM deadlock
In-Reply-To: <CAO1Tb_SimOHGV-R4Z3FHyXbRDVioihBp0XjaEZBO4GYPftYgNA@mail.gmail.com>
Message-ID: <206862890.11069476.1347051471398.JavaMail.root@redhat.com>

----- Original Message -----
| I have a 2 node cluster ( two HP DL360G7 servers) with a shared gfs2
| file system located on an HP Modular Smart Array.
| Node1 is the 'active' server and performs almost all gfs2 access.
| Node2 is a 'passive' backup and rarely accesses the shared file
| system.
| 
| Both nodes are currently running kernel-PAE-2.6.18-274.17.1.el5.i686.
| 
| I am aware of the kernel updates available in the RedHat 5.8 release
| and have reviewed the change logs and associated bug reports, that I
| have access to, to determine if the handful of gfs2 changes might
| apply to this situation. They do not seem to apply but we plan on
| upgrading our production servers when we can to rule out that
| possibility.
| 
| Intermittently (3-4 times a month) the gfs2 file system appears to
| lock up and any processes attempting to access it enter D state.
| Networking continues to function and openais is happy so no fencing
| occurs. Power cycling the passive node breaks the deadlock and
| processing on the active node will continue.
| 
| During the last hang we ran the gfs2_hangalyzer tool, suggested in
| some older threads on the deadlock subject, to capture the dlm and
| glock info.
| 
| I can't find explanations on what some of the fields mean so I'm
| hoping someone can help me interpret the results and confirm if my
| understanding of the output is correct or offer suggestions on how to
| proceed debugging further when it happens again. So far we can't come
| up with a reproduction scenario.
| 
| I have attached the gfs2_hangalyzer summary output as hangalyzer.txt.
| I have the raw lock data as well if required.
| 
| The tool reports that there are two glocks on which processes are
| waiting but no other process holds them. So it looks like a deadlock,
| since if no process owns them, they should have been released.
| 
| The tool also reports that the two glocks were granted to two process
| IDs.
| 
| This is an excerpt from the hangalyzer output:
| 
| --------------------------------------------
| There are 2 glocks with waiters.
| node1, pid 5380 is waiting for glock 2/85187, but no holder was
| found.
|          The dlm has granted lkb "       2           85187" to pid
|          5021
| 
| 
|                       lkb_id N RemoteID  pid exflg lkbflgs stat gr rq
|   waiting n ln             resource name
| node1  : FS1:  3e00003 2  10c0002 5021     0   10000 grnt  5 -1
|   0 0 24 "       2           85187"
| node1  : FS1:  1501c6a 0        0 5380     0       0 wait -1  3
|   0 0 24 "       2           85187"
| node2  : FS1: G:  s:EX n:2/85187 f:dyq t:EX d:SH/0 l:0 a:0 r:4 m:150
| node2  :                         (pending demote, dirty, holder
| queued)
| node2  : FS1:  I: n:1711/545159 t:8 f:0x10 d:0x00000000 s:957/957
| 
|                         lkb_id N RemoteID  pid exflg lkbflgs stat gr
| rq    waiting n ln             resource name
| node2  : FS1:  10c0002 1  3e00003 5021     0       0 grnt  5 -1
|   0 1 24 "       2           85187"
| --------------------------------------------
| 
| As I understand this, on node1 the resource name "2 85187" is granted
| (grnt) to process 5021 on node2 while process 5380 is in wait mode on
| it.
| At the same time, node2 sees that resource name "2 85187" is granted
| (grnt) to process 5021 on node1.
| On node1, process ID 5021 is [glock_workqueue].
| >From 'ps axl':
| 1     0  5021    67  10  -5      0     0 worker S<   ?          0:07
| [glock_workqueue]
| 
| A similar thing occurs for resource name "2 81523".
| 
| --------------------------------------------
|                         lkb_id N RemoteID  pid exflg lkbflgs stat gr
| rq    waiting n ln             resource name
| node1  : FS1:  2f20002 2  2970001 5021    44   10000 grnt  3 -1
|   0 0 24 "       2           81523"
| node1  : FS1:  3961d2b 0        0 5022     0       0 wait -1  5
|   0 0 24 "       2           81523"
| node2  : FS1: G:  s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100
| node2  :                         (pending demote, holder queued)
| node2  : FS1:  I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864
| 
|                         lkb_id N RemoteID  pid exflg lkbflgs stat gr
| rq    waiting n ln             resource name
| node2  : FS1:  2970001 1  2f20002 5029    44       0 grnt  3 -1
|   0 1 24 "       2           81523"
| --------------------------------------------
| 
| On node1 the resource "2 81523" is granted to process 5021 on node2,
| while local process 5022 waits on it.
| On node2, the lock appears to be granted to process 5029 from node1.
| On node1, process ID 5029 is [delete_workqueu].
| >From 'ps axl':
| 1     0  5029    67  10  -5      0     0 worker S<   ?          0:00
| [delete_workqueu]
| 
| Is my understanding of this output correct?
| Is there more info I need to try and gather to diagnose the issue
| when
| it happens again?
| 
| --
| Linux-cluster mailing list
| Linux-cluster at redhat.com
| https://www.redhat.com/mailman/listinfo/linux-cluster

Yes, it sounds like you have the basics right.

The question is: what happened to process 5021 and how did it
dequeue the glock without granting it to one of the waiters?
Did process 5021 show up in ps? If so, I'd dump its call trace
to see what it's doing. In RHEL6 that's a bit easier, for example,
cat /proc/5021/stack or some such. In RHEL/Centos 5 you can always
echo t to /proc/sysrq-trigger and check the console, although if you
don't have your post_fail_delay set high enough, it can cause your node
to get fenced during the output.

With a quick glance, I can't really see any critical patches missing
from that kernel, although there are a few possibilities;
a lot of work has been done since that version. Any chance of moving
to RHEL or Centos 6.3? Debugging these kinds of issues is easier with
RHEL6 because we have gfs2 kernel-level tracing and such, which doesn't
exist in the 2.6.18 kernels.

Regards,

Bob Peterson
Red Hat File Systems


From jpokorny at redhat.com  Fri Sep  7 21:17:56 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Fri, 7 Sep 2012 23:17:56 +0200
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C358CE@alexandria.innova.local>
References: <20120906183231.GA3521@redhat.com>
	<F6CC886A516DF049A33F32C149ED21C823C358CE@alexandria.innova.local>
Message-ID: <20120907211756.GA22108@redhat.com>

On 06/09/12 18:44 +0000, Chip Burke wrote:
> I only eliminated the second & and that was all it took. My guess is this
> is some edge case where the two &s in the string made it break where as a
> single & did not cause things to escape or what have you.

Oh, I now see what is going on.  Once you were successful with alpha-only
password, the authentication is now based merely on the certificates (no
longer a password is involved, no matter if it gets changed or not).
Hence, the XML-unsafe character in the password has no chance to puzzle
ricci as in the discussed case where the initial password
authentication was yet to be done.

Anyway, the mentioned bugs are here to allow using even these
problematic characters safely.

> All that said, thanks again!

You're welcome.

-- 
Jan


From sonredhen at gmail.com  Sat Sep  8 13:08:19 2012
From: sonredhen at gmail.com (Jason Henderson)
Date: Sat, 8 Sep 2012 09:08:19 -0400
Subject: [Linux-cluster] GFS2/DLM deadlock
In-Reply-To: <206862890.11069476.1347051471398.JavaMail.root@redhat.com>
References: <CAO1Tb_SimOHGV-R4Z3FHyXbRDVioihBp0XjaEZBO4GYPftYgNA@mail.gmail.com>
	<206862890.11069476.1347051471398.JavaMail.root@redhat.com>
Message-ID: <CAO1Tb_TNG13AXe_VMgdJAGc3q2Bu2dnA5HrWNtO5Z=wEx5AiAA@mail.gmail.com>

On Fri, Sep 7, 2012 at 4:57 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> The question is: what happened to process 5021 and how did it
> dequeue the glock without granting it to one of the waiters?
> Did process 5021 show up in ps?

The clusters are in a production environment so we have to minimize
the downtime in these hung states. Right now the task of gathering
this debug info is automated and at the end of the capture we reset
the passive node to bring the cluster back into service so it is no
longer in that state.

> If so, I'd dump its call trace
> to see what it's doing. In RHEL6 that's a bit easier, for example,
> cat /proc/5021/stack or some such. In RHEL/Centos 5 you can always
> echo t to /proc/sysrq-trigger and check the console, although if you
> don't have your post_fail_delay set high enough, it can cause your node
> to get fenced during the output.

We will attempt to capture the call trace of the deadlocked processes
during the next hang. Thanks for the suggestion. We will also try and
trace the inode number back to the actual files on the system
involved.

A question on the inode numbers in the hangalyzer output.

In the glock dump for node2 you have these lines:
G:  s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100
    I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864

>From docs I've read I understand that the glock field 'n:2/81523'
tells me that 81523 is the inode number in hex (if the type is 2 or
5).
What are the fields in the inode line following the glock mean (at
least the n: field)?

>
> With a quick glance, I can't really see any critical patches missing
> from that kernel, although there are a few possibilities;
> a lot of work has been done since that version.

Yes, I'm pushing to have the clusters upgraded to the latest 5.8
kernel to rule out the possibility that there is a fix included in
there.

> Any chance of moving
> to RHEL or Centos 6.3? Debugging these kinds of issues is easier with
> RHEL6 because we have gfs2 kernel-level tracing and such, which doesn't
> exist in the 2.6.18 kernels.

We can't move the production clusters to 6.3 because other product
integration issues prevent that.
Would I need more than the updated kernel in 6.3 to get the extra
tracing? Perhaps we could compile the updated 6.3 kernel for the 5.8
release? There have been a lot of kernel build changes so I don't know
if that is even possible at this point.

Thank you for the input. We want to be able to gather enough info to
submit a bug report, if it turns out to be that, so the suggestions on
what else to capture are very valuable. FYI, we only have self support
licenses from RedHat at this point which is why we have not engaged
RedHat support directly yet, but we are highly motivated to find the
problem.

Jason

>
> Regards,
>
> Bob Peterson
> Red Hat File Systems


From rpeterso at redhat.com  Sat Sep  8 13:38:42 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Sat, 8 Sep 2012 09:38:42 -0400 (EDT)
Subject: [Linux-cluster] GFS2/DLM deadlock
In-Reply-To: <CAO1Tb_TNG13AXe_VMgdJAGc3q2Bu2dnA5HrWNtO5Z=wEx5AiAA@mail.gmail.com>
Message-ID: <1193164196.11265878.1347111522560.JavaMail.root@redhat.com>

----- Original Message -----
| A question on the inode numbers in the hangalyzer output.
| 
| In the glock dump for node2 you have these lines:
| G:  s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100
|     I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864
| 
| >From docs I've read I understand that the glock field 'n:2/81523'
| tells me that 81523 is the inode number in hex (if the type is 2 or
| 5).
| What are the fields in the inode line following the glock mean (at
| least the n: field)?

The numbers after the n: are the glock identifier. It consists of
the glock type (2 for inode, 3 for rgrp, 5 for i_open, and a bunch
of special ones) followed by "/" followed by the inode number
(disk inode's block address) in hex.

After f: are the glock flags. For example, "q" means the glock is
queued. There are a bunch of flags with a bunch of meanings.

t: is the target glock state; the lock state it's trying to achieve.
SH is for a shared lock, EX is exclusive, UN is unlocked, etc.
d: is the demote glock state; the lock state it needs to transition
to when the lock is demoted. In this case, demote to UNlocked.
The number after the slash is the demote time.
a: is active items count, or the number of "live" buffers to be written.
r: is the revoke count, or the number of journal items needing to be
   revoked due to delete, etc.
m: is the minimum hold time for the glock, in milliseconds.

On the next line, I: indicates this glock is for an inode.
n:126 is a formal inode number (can be ignored). The number after the
slash, 529699, is the inode disk address in decimal.
t: is the mode, f: are the inode flags, d: are the disk flags, and
s: is the inode's size in decimal. Before the slash is the size stored 
in one of our internal structures. After the slash is the size
according to the vfs inode. In almost all cases they should be the same.

Note that the format of these fields, the flags, and everything
differs from release to release. For example, newer versions of GFS2
don't have two different numbers for inode size.

| We can't move the production clusters to 6.3 because other product
| integration issues prevent that.
| Would I need more than the updated kernel in 6.3 to get the extra
| tracing? Perhaps we could compile the updated 6.3 kernel for the 5.8
| release? There have been a lot of kernel build changes so I don't
| know
| if that is even possible at this point.

I don't think it's possible. There are too many interdependencies.

| Thank you for the input. We want to be able to gather enough info to
| submit a bug report, if it turns out to be that, so the suggestions
| on
| what else to capture are very valuable. FYI, we only have self
| support
| licenses from RedHat at this point which is why we have not engaged
| RedHat support directly yet, but we are highly motivated to find the
| problem.
| 
| Jason

Regards,

Bob Peterson
Red Hat File Systems


From sonredhen at gmail.com  Sun Sep  9 21:31:58 2012
From: sonredhen at gmail.com (Jason Henderson)
Date: Sun, 9 Sep 2012 17:31:58 -0400
Subject: [Linux-cluster] GFS2/DLM deadlock
In-Reply-To: <1193164196.11265878.1347111522560.JavaMail.root@redhat.com>
References: <CAO1Tb_TNG13AXe_VMgdJAGc3q2Bu2dnA5HrWNtO5Z=wEx5AiAA@mail.gmail.com>
	<1193164196.11265878.1347111522560.JavaMail.root@redhat.com>
Message-ID: <CAO1Tb_Syv2CSZ3Jp8Ouo0Hv4HZ=jDEwNKhDGGBEH7bKBAVXzUw@mail.gmail.com>

On Sep 8, 2012 9:44 AM, "Bob Peterson" <rpeterso at redhat.com> wrote:
>
> ----- Original Message -----
> | A question on the inode numbers in the hangalyzer output.
> |
> | In the glock dump for node2 you have these lines:
> | G:  s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100
> |     I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864
> |
> | >From docs I've read I understand that the glock field 'n:2/81523'
> | tells me that 81523 is the inode number in hex (if then type is 2 or
> | 5).
> | What are the fields in the inode line following the glock mean (at
> | least the n: field)?
>
> The numbers after the n: are the glock identifier. It consists of
> the glock type (2 for inode, 3 for rgrp, 5 for i_open, and a bunch
> of special ones) followed by "/" followed by the inode number
> (disk inode's block address) in hex.
>
> After f: are the glock flags. For example, "q" means the glock is
> queued. There are a bunch of flags with a bunch of meanings.
>
> t: is the target glock state; the lock state it's trying to achieve.
> SH is for a shared lock, EX is exclusive, UN is unlocked, etc.
> d: is the demote glock state; the lock state it needs to transition
> to when the lock is demoted. In this case, demote to UNlocked.
> The number after the slash is the demote time.
> a: is active items count, or the number of "live" buffers to be written.
> r: is the revoke count, or the number of journal items needing to be
>    revoked due to delete, etc.
> m: is the minimum hold time for the glock, in milliseconds.
>
> On the next line, I: indicates this glock is for an inode.
> n:126 is a formal inode number (can be ignored). The number after the
> slash, 529699, is the inode disk address in decimal.
> t: is the mode, f: are the inode flags, d: are the disk flags, and
> s: is the inode's size in decimal. Before the slash is the size stored
> in one of our internal structures. After the slash is the size
> according to the vfs inode. In almost all cases they should be the same.
>
> Note that the format of these fields, the flags, and everything
> differs from release to release. For example, newer versions of GFS2
> don't have two different numbers for inode size.
>

I'm not clear on the two different inode numbers in the two lines above:
Which n: number do I use to locate the file the lock is for? The one in the
glock line or the one in the I: line? From RedHat docs I have read, I
should convert the 81523 (from the '2/ 81523') to decimal, which is  533795,
and then use 'find -inum 533795' to locate the file after the filesystem
has been unfrozen.

I guess my confusion is the definition of a "disk inode's block address"
versus an "inode disk address". Could you clarify the difference for me?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120909/d011fafd/attachment.htm>

From swhiteho at redhat.com  Mon Sep 10 10:50:13 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 10 Sep 2012 11:50:13 +0100
Subject: [Linux-cluster] GFS2/DLM deadlock
In-Reply-To: <CAO1Tb_Syv2CSZ3Jp8Ouo0Hv4HZ=jDEwNKhDGGBEH7bKBAVXzUw@mail.gmail.com>
References: <CAO1Tb_TNG13AXe_VMgdJAGc3q2Bu2dnA5HrWNtO5Z=wEx5AiAA@mail.gmail.com>
	<1193164196.11265878.1347111522560.JavaMail.root@redhat.com>
	<CAO1Tb_Syv2CSZ3Jp8Ouo0Hv4HZ=jDEwNKhDGGBEH7bKBAVXzUw@mail.gmail.com>
Message-ID: <1347274213.2709.9.camel@menhir>

Hi,

On Sun, 2012-09-09 at 17:31 -0400, Jason Henderson wrote:
> 
> On Sep 8, 2012 9:44 AM, "Bob Peterson" <rpeterso at redhat.com> wrote:
> >
> > ----- Original Message -----
> > | A question on the inode numbers in the hangalyzer output.
> > |
> > | In the glock dump for node2 you have these lines:
> > | G:  s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100
> > |     I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864
> > |
> > | >From docs I've read I understand that the glock field 'n:2/81523'
> > | tells me that 81523 is the inode number in hex (if then type is 2
> or
> > | 5).
> > | What are the fields in the inode line following the glock mean (at
> > | least the n: field)?
> >
> > The numbers after the n: are the glock identifier. It consists of
> > the glock type (2 for inode, 3 for rgrp, 5 for i_open, and a bunch
> > of special ones) followed by "/" followed by the inode number
> > (disk inode's block address) in hex.
> >
> > After f: are the glock flags. For example, "q" means the glock is
> > queued. There are a bunch of flags with a bunch of meanings.
> >
> > t: is the target glock state; the lock state it's trying to achieve.
> > SH is for a shared lock, EX is exclusive, UN is unlocked, etc.
> > d: is the demote glock state; the lock state it needs to transition
> > to when the lock is demoted. In this case, demote to UNlocked.
> > The number after the slash is the demote time.
> > a: is active items count, or the number of "live" buffers to be
> written.
> > r: is the revoke count, or the number of journal items needing to be
> >    revoked due to delete, etc.
> > m: is the minimum hold time for the glock, in milliseconds.
> >
> > On the next line, I: indicates this glock is for an inode.
> > n:126 is a formal inode number (can be ignored). The number after
> the
> > slash, 529699, is the inode disk address in decimal.
> > t: is the mode, f: are the inode flags, d: are the disk flags, and
> > s: is the inode's size in decimal. Before the slash is the size
> stored
> > in one of our internal structures. After the slash is the size
> > according to the vfs inode. In almost all cases they should be the
> same.
> >
> > Note that the format of these fields, the flags, and everything
> > differs from release to release. For example, newer versions of GFS2
> > don't have two different numbers for inode size.
> >
> 
> I'm not clear on the two different inode numbers in the two lines
> above: Which n: number do I use to locate the file the lock is for?
> The one in the glock line or the one in the I: line? From RedHat docs
> I have read, I should convert the 81523 (from the '2/ 81523') to
> decimal, which is  533795, and then use 'find -inum 533795' to locate
> the file after the filesystem has been unfrozen. 
> 
> 
> I guess my confusion is the definition of a "disk inode's block
> address" versus an "inode disk address". Could you clarify the
> difference for me?
> 
In GFS2 the inode number is the same as the disk block address of the
inode since we have exactly one inode per block, so these things are
both the same,

Steve.


From swhiteho at redhat.com  Mon Sep 10 10:51:06 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 10 Sep 2012 11:51:06 +0100
Subject: [Linux-cluster] GFS2 blocking on one node
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C35C1B@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C35C1B@alexandria.innova.local>
Message-ID: <1347274266.2709.10.camel@menhir>

Hi,

On Fri, 2012-09-07 at 18:38 +0000, Chip Burke wrote:
> My problem is that on a single node of the cluster I can mount a GFS2
> volume, however as soon as I try to write to the volume, access to
> GFS2 freezes on all nodes (Simple ls hangs even). The hang finally
> clears up with the original two nodes regaining access, but the third
> node vomits errors all over (See below from syslog). The other two
> nodes can read and write to GFS2 just fine until this node joins the
> cluster and tries to write. In my digging I found this thread:
> 
> 
> https://www.redhat.com/archives/linux-cluster/2012-August/msg00142.html
> 
> 
> But that sounds different enough that I don't think it is my issue.
> 
> 
> 
> 
> Syslog bits from one of the "good" nodes:
> 
> 
> 
This seems to be missing the syslog info,

Steve.


From rpeterso at redhat.com  Mon Sep 10 12:21:04 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 10 Sep 2012 08:21:04 -0400 (EDT)
Subject: [Linux-cluster] GFS2/DLM deadlock
In-Reply-To: <CAO1Tb_Syv2CSZ3Jp8Ouo0Hv4HZ=jDEwNKhDGGBEH7bKBAVXzUw@mail.gmail.com>
Message-ID: <991832595.11722255.1347279664979.JavaMail.root@redhat.com>

----- Original Message -----
| I'm not clear on the two different inode numbers in the two lines
| above:
| Which n: number do I use to locate the file the lock is for? The one
| in the
| glock line or the one in the I: line? From RedHat docs I have read, I
| should convert the 81523 (from the '2/ 81523') to decimal, which is
|  533795,
| and then use 'find -inum 533795' to locate the file after the
| filesystem
| has been unfrozen.
| 
| I guess my confusion is the definition of a "disk inode's block
| address"
| versus an "inode disk address". Could you clarify the difference for
| me?

Hi,

Technically, there are two inode numbers in GFS2: The "formal inode number"
and the "block number" or "block address". You need only ever concern
yourself with the latter. The first number is not used for anything
(except possibly in NFS situations). In your example, the device block
address is 0x81523, which is 529699 (not 533795 which is 0x82523) decimal.
This is the value that "stat <file>" will return as "Inode:", and it's the
same value that is used by "find -inum <X>". Because we don't typically
use or care about the first number, we often use the terms "block address"
and "inode number" interchangeably, since really they're the same thing
in GFS2.

Regards,

Bob Peterson
Red Hat File Systems


From jason_henderson at mitel.com  Mon Sep 10 13:04:01 2012
From: jason_henderson at mitel.com (Henderson, Jason)
Date: Mon, 10 Sep 2012 09:04:01 -0400
Subject: [Linux-cluster] GFS2/DLM deadlock
In-Reply-To: <991832595.11722255.1347279664979.JavaMail.root@redhat.com>
References: <CAO1Tb_Syv2CSZ3Jp8Ouo0Hv4HZ=jDEwNKhDGGBEH7bKBAVXzUw@mail.gmail.com>
	<991832595.11722255.1347279664979.JavaMail.root@redhat.com>
Message-ID: <CAMROhoZRZUFw-3tQz9zKW7y6+auVG94my272G8yr7J=J_SP6EQ@mail.gmail.com>

On Mon, Sep 10, 2012 at 8:21 AM, Bob Peterson <rpeterso at redhat.com> wrote:
>
> ----- Original Message -----
> | I'm not clear on the two different inode numbers in the two lines
> | above:
> | Which n: number do I use to locate the file the lock is for? The one
> | in the
> | glock line or the one in the I: line? From RedHat docs I have read, I
> | should convert the 81523 (from the '2/ 81523') to decimal, which is
> |  533795,
> | and then use 'find -inum 533795' to locate the file after the
> | filesystem
> | has been unfrozen.
> |
> | I guess my confusion is the definition of a "disk inode's block
> | address"
> | versus an "inode disk address". Could you clarify the difference for
> | me?
>
> Hi,
>
> Technically, there are two inode numbers in GFS2: The "formal inode
> number"
> and the "block number" or "block address". You need only ever concern
> yourself with the latter. The first number is not used for anything
> (except possibly in NFS situations). In your example, the device block
> address is 0x81523, which is 529699 (not 533795 which is 0x82523) decimal.
> This is the value that "stat <file>" will return as "Inode:", and it's the
> same value that is used by "find -inum <X>". Because we don't typically
> use or care about the first number, we often use the terms "block address"
> and "inode number" interchangeably, since really they're the same thing
> in GFS2.
>

Thanks for the explanation and apologies for my bonehead math error.
Not sure why I was converting the wrong hex number. I guess I
shouldn't work on the weekends.  :-)

I will report back the call trace of the hung process when we
reproduce the issue.

-- 
This e-mail (including any attachments) is for the sole use of the intended 
recipient(s) and may contain information that is confidential and/or 
protected by legal privilege. Any unauthorized review, use, copy, 
disclosure or distribution of this e-mail is strictly prohibited. If you 
are not the intended recipient, please notify Mitel immediately and destroy 
all copies of this e-mail.  Mitel does not accept any liability for breach 
of security, error or virus that may result from the transmission of this 
message.


From dr at nevernet.com  Mon Sep 10 15:47:32 2012
From: dr at nevernet.com (david robertson)
Date: Mon, 10 Sep 2012 11:47:32 -0400
Subject: [Linux-cluster] GFS2 - disable plock limit on a live cluster
Message-ID: <CAELz04tn=Zn7VFyPVuO_NhG7r2F1yK-XP_Wj69F3LnDKcV7K=g@mail.gmail.com>

Hello, I have a cluster of 11 hosts all sharing the same volume.
Currently, the default plock limit of 100 is enabled.  The catch is,
this is a live cluster, and it can not be taken offline.

My question is, can I modify the cluster.conf file with the usual
lines (dlm plock_ownership=1, plock_rate_limit=0, etc) and just do
"ccs_tool update" and be done with it, or does each node have to be
rebooted?


From CBurke at innova-partners.com  Mon Sep 10 21:54:53 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Mon, 10 Sep 2012 21:54:53 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <20120907211756.GA22108@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C3667E@alexandria.innova.local>

Well, after a few days of testing it seems luci/ricci is still flakey. If
I update things via luci, the cluster.conf on the local machine with luci
updates, but nothing pushes out to other nodes. If I then manually run
ccs_sync on the node with the new configuration, things push out to the
other nodes fine. While this is wonky, at least it is consistent and
repeatable so I can do things in this manner until fixes are in. Though,
certainly let me know if you want any more logs or whatever from me.
________________________________________
Chip Burke


From td3201 at gmail.com  Tue Sep 11 01:27:06 2012
From: td3201 at gmail.com (Terry)
Date: Mon, 10 Sep 2012 20:27:06 -0500
Subject: [Linux-cluster] fencing for no reason that I can see
Message-ID: <CAHSRzpB5HqVkpAhodMK-sow=v390mmySGOZL3gZxLDC=kFJNYg@mail.gmail.com>

Hello,

I have seen this a few times where one node stops seeing the other
node for some unknown reason and fences it.  Any idea how I can debug
this?  Here's from the node doing the fencing:


Sep 10 19:01:23 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
failed, forming new configuration.
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [QUORUM] Members[1]: 1
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
joined or left the membership and a new membership was formed.
Sep 10 19:01:25 omadvnfs01a rgmanager[10692]: State change:
omadvnfs01b.sec.jel.lc DOWN
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [CPG   ] chosen
downlist: sender r(0) ip(10.198.1.110) ; members(old:2 left:1)
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [MAIN  ] Completed
service synchronization, ready to provide service.
Sep 10 19:01:25 omadvnfs01a fenced[10427]: fencing node omadvnfs01b.sec.jel.lc


And here is from the fenced node:

Sep 10 17:09:27 omadvnfs01b rpc.idmapd[6126]: nfsdcb:
read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
File)
Sep 10 17:14:47 omadvnfs01b rpc.idmapd[6125]: nfsdcb:
read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
File)
Sep 10 19:04:44 omadvnfs01b kernel: imklog 5.8.10, log source =
/proc/kmsg started.
Sep 10 19:04:44 omadvnfs01b rsyslogd: [origin software="rsyslogd"
swVersion="5.8.10" x-pid="2379" x-info="http://www.rsyslog.com"] start


I did notice that they were about 40 seconds off in time.  I just
fixed that but what else can I look for here.  Our monitoring started
noticing things at 19:02:30 that the fenced node was off the grid
which is a little after it was fenced.  What test is performed to see
if the other node is up?  How many times does it try?

Thanks!


From td3201 at gmail.com  Tue Sep 11 02:08:37 2012
From: td3201 at gmail.com (Terry)
Date: Mon, 10 Sep 2012 21:08:37 -0500
Subject: [Linux-cluster] fencing for no reason that I can see
In-Reply-To: <CAHSRzpB5HqVkpAhodMK-sow=v390mmySGOZL3gZxLDC=kFJNYg@mail.gmail.com>
References: <CAHSRzpB5HqVkpAhodMK-sow=v390mmySGOZL3gZxLDC=kFJNYg@mail.gmail.com>
Message-ID: <CAHSRzpCU7aVyHKMURhEGwZswjK+j1QwGDhHiw0wTR3OJgQgXhQ@mail.gmail.com>

On Mon, Sep 10, 2012 at 8:27 PM, Terry <td3201 at gmail.com> wrote:
> Hello,
>
> I have seen this a few times where one node stops seeing the other
> node for some unknown reason and fences it.  Any idea how I can debug
> this?  Here's from the node doing the fencing:
>
>
> Sep 10 19:01:23 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
> failed, forming new configuration.
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [QUORUM] Members[1]: 1
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> Sep 10 19:01:25 omadvnfs01a rgmanager[10692]: State change:
> omadvnfs01b.sec.jel.lc DOWN
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [CPG   ] chosen
> downlist: sender r(0) ip(10.198.1.110) ; members(old:2 left:1)
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> Sep 10 19:01:25 omadvnfs01a fenced[10427]: fencing node omadvnfs01b.sec.jel.lc
>
>
> And here is from the fenced node:
>
> Sep 10 17:09:27 omadvnfs01b rpc.idmapd[6126]: nfsdcb:
> read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
> File)
> Sep 10 17:14:47 omadvnfs01b rpc.idmapd[6125]: nfsdcb:
> read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
> File)
> Sep 10 19:04:44 omadvnfs01b kernel: imklog 5.8.10, log source =
> /proc/kmsg started.
> Sep 10 19:04:44 omadvnfs01b rsyslogd: [origin software="rsyslogd"
> swVersion="5.8.10" x-pid="2379" x-info="http://www.rsyslog.com"] start
>
>
> I did notice that they were about 40 seconds off in time.  I just
> fixed that but what else can I look for here.  Our monitoring started
> noticing things at 19:02:30 that the fenced node was off the grid
> which is a little after it was fenced.  What test is performed to see
> if the other node is up?  How many times does it try?
>
> Thanks!

I guess I should have read the docs more thoroughly.  Right from RHEL
6 cluster guide:
Ensure that exotic bond modes and VLAN tagging are not in use on
interfaces that the cluster uses for inter-node communication.

I am using a 3 interface 802.3ad link aggregate on the production
network.  I could either use an iscsi interface or split one of the
three bond slave interfaces out and dedicate it to inter-node traffic.
 I was also looking into a potential multicast issue but I believe my
switches support it fine (Foundry FLS).  I wouldnt think it would be
intermittent like this.  Anyone have any other thoughts?


From jeff.sturm at eprize.com  Tue Sep 11 14:40:06 2012
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 11 Sep 2012 14:40:06 +0000
Subject: [Linux-cluster] fencing for no reason that I can see
In-Reply-To: <CAHSRzpCU7aVyHKMURhEGwZswjK+j1QwGDhHiw0wTR3OJgQgXhQ@mail.gmail.com>
References: <CAHSRzpB5HqVkpAhodMK-sow=v390mmySGOZL3gZxLDC=kFJNYg@mail.gmail.com>
	<CAHSRzpCU7aVyHKMURhEGwZswjK+j1QwGDhHiw0wTR3OJgQgXhQ@mail.gmail.com>
Message-ID: <B1B9801C5CBC954680D0374CC4EEABA536868DC3@MailNode2.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Terry
> Sent: Monday, September 10, 2012 10:09 PM
> 
> I am using a 3 interface 802.3ad link aggregate on the production network.  I could
> either use an iscsi interface or split one of the three bond slave interfaces out and
> dedicate it to inter-node traffic.
>  I was also looking into a potential multicast issue but I believe my switches support it
> fine (Foundry FLS).  I wouldnt think it would be intermittent like this.  Anyone have any
> other thoughts?

Could be many things, truthfully.  In our experiences VLAN tagging hasn't caused any problem, but I'd certainly heed the warnings about "exotic bond modes".  We've found that simpler is better (i.e. more reliable) when it comes to networks.  Bonding mode 1 (active-passive) usually recovers fast enough to prevent loss of a cluster node.  When splitting active-passive interfaces over multiple independent switches, we've been able to down a switch administratively (for updates, etc.) without losing the cluster.

Make sure spanning-tree is either off or using RSTP everywhere.  The default STP forwarding delays are long enough to crash a cluster.  (We learned this the hard way.)

The other big problem we had turned out to be a firmware defect on the switch, so you can't rule that out either.  If there is any weakness in your network, RHCS is good at finding it!  I won't name the guilty vendor here, other than to say we've found Juniper gear works very well.  (Never tried Foundry.)

-Jeff


From heiko.nardmann at itechnical.de  Tue Sep 11 15:17:37 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Tue, 11 Sep 2012 17:17:37 +0200
Subject: [Linux-cluster] fencing for no reason that I can see
In-Reply-To: <CAHSRzpB5HqVkpAhodMK-sow=v390mmySGOZL3gZxLDC=kFJNYg@mail.gmail.com>
References: <CAHSRzpB5HqVkpAhodMK-sow=v390mmySGOZL3gZxLDC=kFJNYg@mail.gmail.com>
Message-ID: <504F5611.6080809@itechnical.de>

Hi,

I had similar problems. The problem turned out to be that the firmware 
for the Broadcom NICs inside of our Dell R610 has been obsolete resp. 
buggy. So depending on your hardware please have the vendor check your 
firmware/BIOS/... versions - might help ...


Kind regards,

     Heiko

Am 11.09.2012 03:27, schrieb Terry:
> Hello,
>
> I have seen this a few times where one node stops seeing the other
> node for some unknown reason and fences it.  Any idea how I can debug
> this?  Here's from the node doing the fencing:
>
>
> Sep 10 19:01:23 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
> failed, forming new configuration.
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [QUORUM] Members[1]: 1
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> Sep 10 19:01:25 omadvnfs01a rgmanager[10692]: State change:
> omadvnfs01b.sec.jel.lc DOWN
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [CPG   ] chosen
> downlist: sender r(0) ip(10.198.1.110) ; members(old:2 left:1)
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> Sep 10 19:01:25 omadvnfs01a fenced[10427]: fencing node omadvnfs01b.sec.jel.lc
>
>
> And here is from the fenced node:
>
> Sep 10 17:09:27 omadvnfs01b rpc.idmapd[6126]: nfsdcb:
> read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
> File)
> Sep 10 17:14:47 omadvnfs01b rpc.idmapd[6125]: nfsdcb:
> read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
> File)
> Sep 10 19:04:44 omadvnfs01b kernel: imklog 5.8.10, log source =
> /proc/kmsg started.
> Sep 10 19:04:44 omadvnfs01b rsyslogd: [origin software="rsyslogd"
> swVersion="5.8.10" x-pid="2379" x-info="http://www.rsyslog.com"] start
>
>
> I did notice that they were about 40 seconds off in time.  I just
> fixed that but what else can I look for here.  Our monitoring started
> noticing things at 19:02:30 that the fenced node was off the grid
> which is a little after it was fenced.  What test is performed to see
> if the other node is up?  How many times does it try?
>
> Thanks!
>
>


From bentech4you at gmail.com  Wed Sep 12 11:20:45 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Wed, 12 Sep 2012 14:20:45 +0300
Subject: [Linux-cluster] NFS Cluster with multiple filesystem
Message-ID: <CA+C_GOXbGC84L2uhn3icHn5zUqZkMCVJ3HiX9mMZ6HQEJMRY_A@mail.gmail.com>

HI

i have 2 RHEL 6 machine and i installed HA packages..My current stage is i
joined these 2 as cluster by using LUCI

i need to create four NFS cluster file system.

please help me configure NFS HA

thanks & Regards,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120912/ac2e242d/attachment.htm>

From jpokorny at redhat.com  Wed Sep 12 12:00:09 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Wed, 12 Sep 2012 14:00:09 +0200
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C3667E@alexandria.innova.local>
References: <20120907211756.GA22108@redhat.com>
	<F6CC886A516DF049A33F32C149ED21C823C3667E@alexandria.innova.local>
Message-ID: <20120912115703.GF22108@redhat.com>

On 10/09/12 21:54 +0000, Chip Burke wrote:
> Well, after a few days of testing it seems luci/ricci is still flakey.

At first, there seems to be more than "XML entities" problem involved.
This is because luci, unlike ccs and ccs_sync ("cman_tool version" uses
ccs_sync under the hood) should not suffer from that one.
So probably yet another separate issue...

> If I update things via luci, the cluster.conf on the local machine with
> luci updates, but nothing pushes out to other nodes. If I then manually run
> ccs_sync on the node with the new configuration, things push out to the
> other nodes fine. While this is wonky, at least it is consistent and
> repeatable so I can do things in this manner until fixes are in. Though,
> certainly let me know if you want any more logs or whatever from me.

To get a better error message in luci.log, something to start with, could
you apply a workaround for that "no translator" issue, please?

The recipe, based on the real patch to fix the mentioned bug, is as
follows (as root on the host with luci installed):

---

pushd $(rpm --eval %python_sitearch)
cat <<EOF | patch --fuzz=3 -b .std
diff --git a/luci/lib/ricci_helpers.py b/luci/lib/ricci_helpers.py
--- a/luci/lib/ricci_helpers.py
+++ b/luci/lib/ricci_helpers.py
@@ -7,6 +7,9 @@
 
 import threading
 
+import pylons
+from pylons.i18n.translation import _get_translator
+
 from luci.lib.helpers import ugettext as _
 
 from luci.model import DBSession
@@ -29,6 +33,14 @@ class PWorker(threading.Thread):
         self.cluster_members_only = cluster_members_only
 
     def run(self):
+        # see http://comments.gmane.org/gmane.comp.web.turbogears/46896
+        # this is stolen from the pylons test setup;
+        # it will make sure the gettext-stuff is working, that is
+        # we inject translator object to this private thread similarly
+        # as it is done by the framework in per-request threads
+        translator = _get_translator(None)
+        pylons.translator._push_object(translator)
+
         while True:
             self.mutex.acquire()
             if len(self.triples) == 0:
@@ -67,6 +79,8 @@ class PWorker(threading.Thread):
             self.ret[triple[0][0]] = r
             self.mutex.release()
 
+        pylons.translator._pop_object()
+
 def send_batch_parallel(triples, max_threads, cluster_members_only=False):
     mutex = threading.RLock()
     threads = list()
EOF
popd

---

If luci.log still does not provide any better information, please run this
sed one-liner (once the luci was started at least once so the respective
file is assuredly in place; luci restart needed afterwards):

---
sed -i.std "/logger_luci/bone;b;:one s|INFO|DEBUG|;ttwo;n;bone;:two n;btwo" \
  /var/lib/luci/etc/luci.ini
---


Thanks,
Jan


From jpokorny at redhat.com  Wed Sep 12 12:33:15 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Wed, 12 Sep 2012 14:33:15 +0200
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <20120912115703.GF22108@redhat.com>
References: <20120907211756.GA22108@redhat.com>
	<F6CC886A516DF049A33F32C149ED21C823C3667E@alexandria.innova.local>
	<20120912115703.GF22108@redhat.com>
Message-ID: <20120912123315.GG22108@redhat.com>

On 12/09/12 14:00 +0200, Jan Pokorn? wrote:
> cat <<EOF | patch --fuzz=3 -b .std

Sorry, this line should read:

cat <<EOF | patch --fuzz=3 -b -z.std -p1

-- 
Jan


From bentech4you at gmail.com  Wed Sep 12 12:56:45 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Wed, 12 Sep 2012 15:56:45 +0300
Subject: [Linux-cluster] cluster failed to start
Message-ID: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>

Hi

i created a RHEL 6 cluster and i joined both nodes..i created one IP as
resource and select that ip under services..

when i try to start that cluster, it's failing without any error

please help me

regards,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120912/febfa92d/attachment.htm>

From jpokorny at redhat.com  Wed Sep 12 13:27:14 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Wed, 12 Sep 2012 15:27:14 +0200
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
Message-ID: <20120912132714.GI22108@redhat.com>

On 12/09/12 15:56 +0300, Ben .T.George wrote:
> i created a RHEL 6 cluster and i joined both nodes..i created one IP as
> resource and select that ip under services..

It looks like you are using luci to manage the cluster, right?

> when i try to start that cluster, it's failing without any error

What exactly does "start that cluster" mean here?
If you use "service start cman", at least the failing stage is
indicated.

If you could be more specific about problem and perhaps about
the version of packages you use, your chances of getting the issue
solved here will raise significantly ;-)

-- 
Jan


From dr at nevernet.com  Wed Sep 12 13:34:32 2012
From: dr at nevernet.com (david robertson)
Date: Wed, 12 Sep 2012 09:34:32 -0400
Subject: [Linux-cluster] GFS2 - disable plock limit on a live cluster
In-Reply-To: <CAELz04tn=Zn7VFyPVuO_NhG7r2F1yK-XP_Wj69F3LnDKcV7K=g@mail.gmail.com>
References: <CAELz04tn=Zn7VFyPVuO_NhG7r2F1yK-XP_Wj69F3LnDKcV7K=g@mail.gmail.com>
Message-ID: <CAELz04sUnrxcyoahgRjPYCZCLDEypnf5g7PkMfRjm7t6Hrk5CA@mail.gmail.com>

Anyone have any ideas?  I basically just need to know if a
plock_rate_limit change will take place immediately, or if a restart
is needed.


On Mon, Sep 10, 2012 at 11:47 AM, david robertson <dr at nevernet.com> wrote:
> Hello, I have a cluster of 11 hosts all sharing the same volume.
> Currently, the default plock limit of 100 is enabled.  The catch is,
> this is a live cluster, and it can not be taken offline.
>
> My question is, can I modify the cluster.conf file with the usual
> lines (dlm plock_ownership=1, plock_rate_limit=0, etc) and just do
> "ccs_tool update" and be done with it, or does each node have to be
> rebooted?


From bentech4you at gmail.com  Wed Sep 12 13:39:55 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Wed, 12 Sep 2012 16:39:55 +0300
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <20120912132714.GI22108@redhat.com>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
	<20120912132714.GI22108@redhat.com>
Message-ID: <CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>

HI

Thanks for your reply.

i created 2 node cluster with RHEL6 by using redhat cluster suite.

i joined cluster nodes by using LUCI

i created one IP as resource and source as that IP i created.

i started cluster , but on luci status showing like disabled.but that IP is
pining and it is added on node2

ip addr is showing that ip.

#clustat  is showing both nodes online.

actually my whole idea is to make an NFS cluster.

please help me resolve this.I tried to start this IP on Node1, that also
failing


Regads,
Ben

On Wed, Sep 12, 2012 at 4:27 PM, Jan Pokorn? <jpokorny at redhat.com> wrote:

> On 12/09/12 15:56 +0300, Ben .T.George wrote:
> > i created a RHEL 6 cluster and i joined both nodes..i created one IP as
> > resource and select that ip under services..
>
> It looks like you are using luci to manage the cluster, right?
>
> > when i try to start that cluster, it's failing without any error
>
> What exactly does "start that cluster" mean here?
> If you use "service start cman", at least the failing stage is
> indicated.
>
> If you could be more specific about problem and perhaps about
> the version of packages you use, your chances of getting the issue
> solved here will raise significantly ;-)
>
> --
> Jan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120912/7a201e07/attachment.htm>

From teigland at redhat.com  Wed Sep 12 18:42:07 2012
From: teigland at redhat.com (David Teigland)
Date: Wed, 12 Sep 2012 14:42:07 -0400
Subject: [Linux-cluster] GFS2 - disable plock limit on a live cluster
In-Reply-To: <CAELz04tn=Zn7VFyPVuO_NhG7r2F1yK-XP_Wj69F3LnDKcV7K=g@mail.gmail.com>
References: <CAELz04tn=Zn7VFyPVuO_NhG7r2F1yK-XP_Wj69F3LnDKcV7K=g@mail.gmail.com>
Message-ID: <20120912184207.GF29715@redhat.com>

On Mon, Sep 10, 2012 at 11:47:32AM -0400, david robertson wrote:
> Hello, I have a cluster of 11 hosts all sharing the same volume.
> Currently, the default plock limit of 100 is enabled.  The catch is,
> this is a live cluster, and it can not be taken offline.
> 
> My question is, can I modify the cluster.conf file with the usual
> lines (dlm plock_ownership=1, plock_rate_limit=0, etc) and just do
> "ccs_tool update" and be done with it, or does each node have to be
> rebooted?

plock_rate_limit=0: You can push out the new config and restart cman on
each node separately.

plock_ownership=1: First make sure you need it; it will usually make
things worse.  Second, you need to stop the entire cluster, push out the
new setting, then start the cluster again.


From dr at nevernet.com  Wed Sep 12 18:45:07 2012
From: dr at nevernet.com (david robertson)
Date: Wed, 12 Sep 2012 14:45:07 -0400
Subject: [Linux-cluster] GFS2 - disable plock limit on a live cluster
In-Reply-To: <20120912184207.GF29715@redhat.com>
References: <CAELz04tn=Zn7VFyPVuO_NhG7r2F1yK-XP_Wj69F3LnDKcV7K=g@mail.gmail.com>
	<20120912184207.GF29715@redhat.com>
Message-ID: <CAELz04tXjOYHuoJYiU12cPSzWqFTYSDz-nsOeqQM2uXipR+jYg@mail.gmail.com>

> plock_rate_limit=0: You can push out the new config and restart cman on
> each node separately.

Are you sure about this?  The last time I modified the init script
with "-l 0" to disable the limit, the node I did that on refused to
join the cluster, citing a "version mismatch" or some such.


From teigland at redhat.com  Wed Sep 12 18:53:23 2012
From: teigland at redhat.com (David Teigland)
Date: Wed, 12 Sep 2012 14:53:23 -0400
Subject: [Linux-cluster] GFS2 - disable plock limit on a live cluster
In-Reply-To: <CAELz04tXjOYHuoJYiU12cPSzWqFTYSDz-nsOeqQM2uXipR+jYg@mail.gmail.com>
References: <CAELz04tn=Zn7VFyPVuO_NhG7r2F1yK-XP_Wj69F3LnDKcV7K=g@mail.gmail.com>
	<20120912184207.GF29715@redhat.com>
	<CAELz04tXjOYHuoJYiU12cPSzWqFTYSDz-nsOeqQM2uXipR+jYg@mail.gmail.com>
Message-ID: <20120912185323.GH29715@redhat.com>

On Wed, Sep 12, 2012 at 02:45:07PM -0400, david robertson wrote:
> > plock_rate_limit=0: You can push out the new config and restart cman on
> > each node separately.
> 
> Are you sure about this?  The last time I modified the init script
> with "-l 0" to disable the limit, the node I did that on refused to
> join the cluster, citing a "version mismatch" or some such.

I'm not sure why you would do both: adding -l0 to cman init and adding
plock_rate_limit=0 to cluster.conf are redundant.  version mismatch sounds
related to syncing a new config file version, which I can't really help
with.


From dr at nevernet.com  Wed Sep 12 19:15:43 2012
From: dr at nevernet.com (david robertson)
Date: Wed, 12 Sep 2012 15:15:43 -0400
Subject: [Linux-cluster] GFS2 - disable plock limit on a live cluster
In-Reply-To: <20120912185323.GH29715@redhat.com>
References: <CAELz04tn=Zn7VFyPVuO_NhG7r2F1yK-XP_Wj69F3LnDKcV7K=g@mail.gmail.com>
	<20120912184207.GF29715@redhat.com>
	<CAELz04tXjOYHuoJYiU12cPSzWqFTYSDz-nsOeqQM2uXipR+jYg@mail.gmail.com>
	<20120912185323.GH29715@redhat.com>
Message-ID: <CAELz04v0gXjXzybvJBUvRuGs++uoKpgXY5dLK2DZConNPkci0w@mail.gmail.com>

> I'm not sure why you would do both: adding -l0 to cman init and adding
> plock_rate_limit=0 to cluster.conf are redundant.  version mismatch sounds
> related to syncing a new config file version, which I can't really help
> with.

At the time, I didn't do both - only adding "-l 0" to the init script.
 My fear is, I push out the new config (with plock_rate_limit=0), and
restart one of the nodes, the restarted node will refuse to join the
cluster, because it differs from what's currently running in the live
cluster.


From jpokorny at redhat.com  Wed Sep 12 19:35:22 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Wed, 12 Sep 2012 21:35:22 +0200
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
	<20120912132714.GI22108@redhat.com>
	<CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>
Message-ID: <20120912193522.GN22108@redhat.com>

Hello Ben,

On 12/09/12 16:39 +0300, Ben .T.George wrote:
> i created 2 node cluster with RHEL6 by using redhat cluster suite.
> 
> i joined cluster nodes by using LUCI
> 
> i created one IP as resource and source as that IP i created.
> 
> i started cluster , but on luci status showing like disabled.but that IP is
> pining and it is added on node2
> 
> ip addr is showing that ip.
> 
> #clustat  is showing both nodes online.

actually thanks for bringing up what showed up to be a real issue [1].
Could you please try "service modclusterd start" across the nodes (and perhaps
making the service persistent with chkconfig) to see if it helps you?

In the mean time, this should be a workaround in such cases;
more decent solution for this bug is underway.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=856785

Thanks,
Jan


From bentech4you at gmail.com  Thu Sep 13 06:44:11 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Thu, 13 Sep 2012 09:44:11 +0300
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <20120912193522.GN22108@redhat.com>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
	<20120912132714.GI22108@redhat.com>
	<CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>
	<20120912193522.GN22108@redhat.com>
Message-ID: <CA+C_GOUH_EZMUhL3Os9ahs_bD2zRMxAHSWdzeuxtBDTi94TURg@mail.gmail.com>

Hi

i manually created a cluster.conf file and copied to my 2 nodes..now it's
working fine with one one NFS HA sare.i need to add three more shares..

please check this http://pastebin.com/eM08vrC5  this is my cluster.conf

how can i add three more shares to this cluster.conf file.?

please help..i got stacked with project.after testing this setup i need
to implement on production


Regards,
Ben


On Wed, Sep 12, 2012 at 10:35 PM, Jan Pokorn? <jpokorny at redhat.com> wrote:

> Hello Ben,
>
> On 12/09/12 16:39 +0300, Ben .T.George wrote:
> > i created 2 node cluster with RHEL6 by using redhat cluster suite.
> >
> > i joined cluster nodes by using LUCI
> >
> > i created one IP as resource and source as that IP i created.
> >
> > i started cluster , but on luci status showing like disabled.but that IP
> is
> > pining and it is added on node2
> >
> > ip addr is showing that ip.
> >
> > #clustat  is showing both nodes online.
>
> actually thanks for bringing up what showed up to be a real issue [1].
> Could you please try "service modclusterd start" across the nodes (and
> perhaps
> making the service persistent with chkconfig) to see if it helps you?
>
> In the mean time, this should be a workaround in such cases;
> more decent solution for this bug is underway.
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=856785
>
> Thanks,
> Jan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120913/3d274b85/attachment.htm>

From lists at alteeve.ca  Thu Sep 13 06:53:16 2012
From: lists at alteeve.ca (digimer)
Date: Thu, 13 Sep 2012 02:53:16 -0400
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <CA+C_GOUH_EZMUhL3Os9ahs_bD2zRMxAHSWdzeuxtBDTi94TURg@mail.gmail.com>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
	<20120912132714.GI22108@redhat.com>
	<CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>
	<20120912193522.GN22108@redhat.com>
	<CA+C_GOUH_EZMUhL3Os9ahs_bD2zRMxAHSWdzeuxtBDTi94TURg@mail.gmail.com>
Message-ID: <505182DC.1010303@alteeve.ca>

Please add fencing. Without it, the first time a node fails, your 
cluster will hang (by design). Most servers have IPMI (or similar), so 
you can probably use fence_ipmilan or one of the brand-specific agents 
like fence_ilo for HP's iLO.

On 09/13/2012 02:44 AM, Ben .T.George wrote:
> Hi
>
> i manually created a cluster.conf file and copied to my 2 nodes..now
> it's working fine with one one NFS HA sare.i need to add three more shares..
>
> please check this http://pastebin.com/eM08vrC5  this is my cluster.conf
>
> how can i add three more shares to this cluster.conf file.?
>
> please help..i got stacked with project.after testing this setup i need
> to implement on production
>
>
> Regards,
> Ben
>
>
>
> On Wed, Sep 12, 2012 at 10:35 PM, Jan Pokorn? <jpokorny at redhat.com
> <mailto:jpokorny at redhat.com>> wrote:
>
>     Hello Ben,
>
>     On 12/09/12 16:39 +0300, Ben .T.George wrote:
>      > i created 2 node cluster with RHEL6 by using redhat cluster suite.
>      >
>      > i joined cluster nodes by using LUCI
>      >
>      > i created one IP as resource and source as that IP i created.
>      >
>      > i started cluster , but on luci status showing like disabled.but
>     that IP is
>      > pining and it is added on node2
>      >
>      > ip addr is showing that ip.
>      >
>      > #clustat  is showing both nodes online.
>
>     actually thanks for bringing up what showed up to be a real issue [1].
>     Could you please try "service modclusterd start" across the nodes
>     (and perhaps
>     making the service persistent with chkconfig) to see if it helps you?
>
>     In the mean time, this should be a workaround in such cases;
>     more decent solution for this bug is underway.
>
>     [1] https://bugzilla.redhat.com/show_bug.cgi?id=856785
>
>     Thanks,
>     Jan
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Yours Sincerely
>
> *#!/usr/bin/env python
> #Mysignature.py :)*
>
> Signature = " " " Ben.T.George \n
>                    Linux System Administrator \n
>                    Diyar United Company \n
>                    kuwait \n
>                    Phone : +965 - 50629829 \n " ""
>
> Print Signature
>
>
>


From bentech4you at gmail.com  Thu Sep 13 08:11:53 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Thu, 13 Sep 2012 11:11:53 +0300
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <505182DC.1010303@alteeve.ca>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
	<20120912132714.GI22108@redhat.com>
	<CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>
	<20120912193522.GN22108@redhat.com>
	<CA+C_GOUH_EZMUhL3Os9ahs_bD2zRMxAHSWdzeuxtBDTi94TURg@mail.gmail.com>
	<505182DC.1010303@alteeve.ca>
Message-ID: <CA+C_GOX8dKgb9hd-oFjoht5OCj=q8Yjqz0Lcbjpcgbct=YfFoA@mail.gmail.com>

HI

thanks for your reply..But on this test setup how can i configure
fencing..i am running 2 Sun X4200 machine.and the freenas is used to create
Iscsi.

my current NFS setup is working perfectly..and in the production setup , i
need to implement this on cisco UCS. i saw cisco UCS is there under fencing
options

please help me add three more nfs shares to my existing configuration..


regards,
ben

On Thu, Sep 13, 2012 at 9:53 AM, digimer <lists at alteeve.ca> wrote:

> Please add fencing. Without it, the first time a node fails, your cluster
> will hang (by design). Most servers have IPMI (or similar), so you can
> probably use fence_ipmilan or one of the brand-specific agents like
> fence_ilo for HP's iLO.
>
>
> On 09/13/2012 02:44 AM, Ben .T.George wrote:
>
>> Hi
>>
>> i manually created a cluster.conf file and copied to my 2 nodes..now
>> it's working fine with one one NFS HA sare.i need to add three more
>> shares..
>>
>> please check this http://pastebin.com/eM08vrC5  this is my cluster.conf
>>
>> how can i add three more shares to this cluster.conf file.?
>>
>> please help..i got stacked with project.after testing this setup i need
>> to implement on production
>>
>>
>> Regards,
>> Ben
>>
>>
>>
>> On Wed, Sep 12, 2012 at 10:35 PM, Jan Pokorn? <jpokorny at redhat.com
>> <mailto:jpokorny at redhat.com>> wrote:
>>
>>     Hello Ben,
>>
>>     On 12/09/12 16:39 +0300, Ben .T.George wrote:
>>      > i created 2 node cluster with RHEL6 by using redhat cluster suite.
>>      >
>>      > i joined cluster nodes by using LUCI
>>      >
>>      > i created one IP as resource and source as that IP i created.
>>      >
>>      > i started cluster , but on luci status showing like disabled.but
>>     that IP is
>>      > pining and it is added on node2
>>      >
>>      > ip addr is showing that ip.
>>      >
>>      > #clustat  is showing both nodes online.
>>
>>     actually thanks for bringing up what showed up to be a real issue [1].
>>     Could you please try "service modclusterd start" across the nodes
>>     (and perhaps
>>     making the service persistent with chkconfig) to see if it helps you?
>>
>>     In the mean time, this should be a workaround in such cases;
>>     more decent solution for this bug is underway.
>>
>>     [1] https://bugzilla.redhat.com/**show_bug.cgi?id=856785<https://bugzilla.redhat.com/show_bug.cgi?id=856785>
>>
>>     Thanks,
>>     Jan
>>
>>     --
>>     Linux-cluster mailing list
>>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.**com<Linux-cluster at redhat.com>
>> >
>>
>>     https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>
>>
>>
>>
>> --
>> Yours Sincerely
>>
>> *#!/usr/bin/env python
>> #Mysignature.py :)*
>>
>>
>> Signature = " " " Ben.T.George \n
>>                    Linux System Administrator \n
>>                    Diyar United Company \n
>>                    kuwait \n
>>                    Phone : +965 - 50629829 \n " ""
>>
>> Print Signature
>>
>>
>>
>>
>
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120913/ecd48eb3/attachment.htm>

From heiko.nardmann at itechnical.de  Thu Sep 13 09:28:26 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Thu, 13 Sep 2012 11:28:26 +0200
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <CA+C_GOX8dKgb9hd-oFjoht5OCj=q8Yjqz0Lcbjpcgbct=YfFoA@mail.gmail.com>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
	<20120912132714.GI22108@redhat.com>
	<CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>
	<20120912193522.GN22108@redhat.com>
	<CA+C_GOUH_EZMUhL3Os9ahs_bD2zRMxAHSWdzeuxtBDTi94TURg@mail.gmail.com>
	<505182DC.1010303@alteeve.ca>
	<CA+C_GOX8dKgb9hd-oFjoht5OCj=q8Yjqz0Lcbjpcgbct=YfFoA@mail.gmail.com>
Message-ID: <5051A73A.10200@itechnical.de>

Hi!

Those machines should have some ALOM/ILOM/whatever mechanism which you 
could use for power fencing e.g. ...

You should check whether a corresponding fence agent exists then.


Kind regards,

     Heiko

Am 13.09.2012 10:11, schrieb Ben .T.George:
> HI
>
> thanks for your reply..But on this test setup how can i configure 
> fencing..i am running 2 Sun X4200 machine.and the freenas is used to 
> create Iscsi.
>
> my current NFS setup is working perfectly..and in the production setup 
> , i need to implement this on cisco UCS. i saw cisco UCS is there 
> under fencing options
>
> please help me add three more nfs shares to my existing configuration..
>
>
> regards,
> ben
>
> On Thu, Sep 13, 2012 at 9:53 AM, digimer <lists at alteeve.ca 
> <mailto:lists at alteeve.ca>> wrote:
>
>     Please add fencing. Without it, the first time a node fails, your
>     cluster will hang (by design). Most servers have IPMI (or
>     similar), so you can probably use fence_ipmilan or one of the
>     brand-specific agents like fence_ilo for HP's iLO.
>
>
>     On 09/13/2012 02:44 AM, Ben .T.George wrote:
>
>         Hi
>
>         i manually created a cluster.conf file and copied to my 2
>         nodes..now
>         it's working fine with one one NFS HA sare.i need to add three
>         more shares..
>
>         please check this http://pastebin.com/eM08vrC5  this is my
>         cluster.conf
>
>         how can i add three more shares to this cluster.conf file.?
>
>         please help..i got stacked with project.after testing this
>         setup i need
>         to implement on production
>
>
>         Regards,
>         Ben
>
>
>
>         On Wed, Sep 12, 2012 at 10:35 PM, Jan Pokorn?
>         <jpokorny at redhat.com <mailto:jpokorny at redhat.com>
>         <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>> wrote:
>
>             Hello Ben,
>
>             On 12/09/12 16:39 +0300, Ben .T.George wrote:
>              > i created 2 node cluster with RHEL6 by using redhat
>         cluster suite.
>              >
>              > i joined cluster nodes by using LUCI
>              >
>              > i created one IP as resource and source as that IP i
>         created.
>              >
>              > i started cluster , but on luci status showing like
>         disabled.but
>             that IP is
>              > pining and it is added on node2
>              >
>              > ip addr is showing that ip.
>              >
>              > #clustat  is showing both nodes online.
>
>             actually thanks for bringing up what showed up to be a
>         real issue [1].
>             Could you please try "service modclusterd start" across
>         the nodes
>             (and perhaps
>             making the service persistent with chkconfig) to see if it
>         helps you?
>
>             In the mean time, this should be a workaround in such cases;
>             more decent solution for this bug is underway.
>
>             [1] https://bugzilla.redhat.com/show_bug.cgi?id=856785
>
>             Thanks,
>             Jan
>
>             --
>             Linux-cluster mailing list
>         Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>         <mailto:Linux-cluster at redhat.com
>         <mailto:Linux-cluster at redhat.com>>
>
>         https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
>         --
>         Yours Sincerely
>
>         *#!/usr/bin/env python
>         #Mysignature.py :)*
>
>
>         Signature = " " " Ben.T.George \n
>                            Linux System Administrator \n
>                            Diyar United Company \n
>                            kuwait \n
>                            Phone : +965 - 50629829
>         <tel:%2B965%20-%2050629829> \n " ""
>
>         Print Signature
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120913/fa2c2060/attachment.htm>

From bentech4you at gmail.com  Thu Sep 13 09:40:52 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Thu, 13 Sep 2012 12:40:52 +0300
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <5051A73A.10200@itechnical.de>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
	<20120912132714.GI22108@redhat.com>
	<CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>
	<20120912193522.GN22108@redhat.com>
	<CA+C_GOUH_EZMUhL3Os9ahs_bD2zRMxAHSWdzeuxtBDTi94TURg@mail.gmail.com>
	<505182DC.1010303@alteeve.ca>
	<CA+C_GOX8dKgb9hd-oFjoht5OCj=q8Yjqz0Lcbjpcgbct=YfFoA@mail.gmail.com>
	<5051A73A.10200@itechnical.de>
Message-ID: <CA+C_GOVnZ7V61Wa1Wr_EAkOo2xua8S+-dQgqGJZq+_80D3-New@mail.gmail.com>

HI

yes these machine have ILOM, but there is no ILOM fencing listed under LUCI
page.

without fencing i can do testing noe.?

plese negligent fencing part and give me instructions.because in my
production, we are using cisco UCS and the cisco UCS fencing is listed on
LUCI page

regards,
Ben

On Thu, Sep 13, 2012 at 12:28 PM, Heiko Nardmann <
heiko.nardmann at itechnical.de> wrote:

>  Hi!
>
> Those machines should have some ALOM/ILOM/whatever mechanism which you
> could use for power fencing e.g. ...
>
> You should check whether a corresponding fence agent exists then.
>
>
> Kind regards,
>
>     Heiko
>
> Am 13.09.2012 10:11, schrieb Ben .T.George:
>
> HI
>
>  thanks for your reply..But on this test setup how can i configure
> fencing..i am running 2 Sun X4200 machine.and the freenas is used to create
> Iscsi.
>
>  my current NFS setup is working perfectly..and in the production setup ,
> i need to implement this on cisco UCS. i saw cisco UCS is there under
> fencing options
>
>  please help me add three more nfs shares to my existing configuration..
>
>
>  regards,
> ben
>
> On Thu, Sep 13, 2012 at 9:53 AM, digimer <lists at alteeve.ca> wrote:
>
>> Please add fencing. Without it, the first time a node fails, your cluster
>> will hang (by design). Most servers have IPMI (or similar), so you can
>> probably use fence_ipmilan or one of the brand-specific agents like
>> fence_ilo for HP's iLO.
>>
>>
>> On 09/13/2012 02:44 AM, Ben .T.George wrote:
>>
>>>  Hi
>>>
>>> i manually created a cluster.conf file and copied to my 2 nodes..now
>>> it's working fine with one one NFS HA sare.i need to add three more
>>> shares..
>>>
>>> please check this http://pastebin.com/eM08vrC5  this is my cluster.conf
>>>
>>> how can i add three more shares to this cluster.conf file.?
>>>
>>> please help..i got stacked with project.after testing this setup i need
>>> to implement on production
>>>
>>>
>>> Regards,
>>> Ben
>>>
>>>
>>>
>>> On Wed, Sep 12, 2012 at 10:35 PM, Jan Pokorn? <jpokorny at redhat.com
>>>  <mailto:jpokorny at redhat.com>> wrote:
>>>
>>>     Hello Ben,
>>>
>>>     On 12/09/12 16:39 +0300, Ben .T.George wrote:
>>>      > i created 2 node cluster with RHEL6 by using redhat cluster suite.
>>>      >
>>>      > i joined cluster nodes by using LUCI
>>>      >
>>>      > i created one IP as resource and source as that IP i created.
>>>      >
>>>      > i started cluster , but on luci status showing like disabled.but
>>>     that IP is
>>>      > pining and it is added on node2
>>>      >
>>>      > ip addr is showing that ip.
>>>      >
>>>      > #clustat  is showing both nodes online.
>>>
>>>     actually thanks for bringing up what showed up to be a real issue
>>> [1].
>>>     Could you please try "service modclusterd start" across the nodes
>>>     (and perhaps
>>>     making the service persistent with chkconfig) to see if it helps you?
>>>
>>>     In the mean time, this should be a workaround in such cases;
>>>     more decent solution for this bug is underway.
>>>
>>>     [1] https://bugzilla.redhat.com/show_bug.cgi?id=856785
>>>
>>>     Thanks,
>>>     Jan
>>>
>>>     --
>>>     Linux-cluster mailing list
>>>      Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>>
>>>     https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>>
>>>
>>> --
>>> Yours Sincerely
>>>
>>>  *#!/usr/bin/env python
>>> #Mysignature.py :)*
>>>
>>>
>>> Signature = " " " Ben.T.George \n
>>>                    Linux System Administrator \n
>>>                    Diyar United Company \n
>>>                    kuwait \n
>>>                    Phone : +965 - 50629829 \n " ""
>>>
>>> Print Signature
>>>
>>>
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120913/3fafcba8/attachment.htm>

From heiko.nardmann at itechnical.de  Thu Sep 13 09:55:31 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Thu, 13 Sep 2012 11:55:31 +0200
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <CA+C_GOVnZ7V61Wa1Wr_EAkOo2xua8S+-dQgqGJZq+_80D3-New@mail.gmail.com>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
	<20120912132714.GI22108@redhat.com>
	<CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>
	<20120912193522.GN22108@redhat.com>
	<CA+C_GOUH_EZMUhL3Os9ahs_bD2zRMxAHSWdzeuxtBDTi94TURg@mail.gmail.com>
	<505182DC.1010303@alteeve.ca>
	<CA+C_GOX8dKgb9hd-oFjoht5OCj=q8Yjqz0Lcbjpcgbct=YfFoA@mail.gmail.com>
	<5051A73A.10200@itechnical.de>
	<CA+C_GOVnZ7V61Wa1Wr_EAkOo2xua8S+-dQgqGJZq+_80D3-New@mail.gmail.com>
Message-ID: <5051AD93.3020405@itechnical.de>

Else you could have googled ...

And maybe have found 
http://permalink.gmane.org/gmane.linux.highavailability.user/14360 ... 
which might have been helpful ...


Kind regards,

     Heiko

Am 13.09.2012 11:40, schrieb Ben .T.George:
> HI
>
> yes these machine have ILOM, but there is no ILOM fencing listed under 
> LUCI page.
>
> without fencing i can do testing noe.?
>
> plese negligent fencing part and give me instructions.because in my 
> production, we are using cisco UCS and the cisco UCS fencing is listed 
> on LUCI page
>
> regards,
> Ben
>
> On Thu, Sep 13, 2012 at 12:28 PM, Heiko Nardmann 
> <heiko.nardmann at itechnical.de <mailto:heiko.nardmann at itechnical.de>> 
> wrote:
>
>     Hi!
>
>     Those machines should have some ALOM/ILOM/whatever mechanism which
>     you could use for power fencing e.g. ...
>
>     You should check whether a corresponding fence agent exists then.
>
>
>     Kind regards,
>
>         Heiko
>
>     Am 13.09.2012 10:11, schrieb Ben .T.George:
>>     HI
>>
>>     thanks for your reply..But on this test setup how can i configure
>>     fencing..i am running 2 Sun X4200 machine.and the freenas is used
>>     to create Iscsi.
>>
>>     my current NFS setup is working perfectly..and in the production
>>     setup , i need to implement this on cisco UCS. i saw cisco UCS is
>>     there under fencing options
>>
>>     please help me add three more nfs shares to my existing
>>     configuration..
>>
>>
>>     regards,
>>     ben
>>
>>     On Thu, Sep 13, 2012 at 9:53 AM, digimer <lists at alteeve.ca
>>     <mailto:lists at alteeve.ca>> wrote:
>>
>>         Please add fencing. Without it, the first time a node fails,
>>         your cluster will hang (by design). Most servers have IPMI
>>         (or similar), so you can probably use fence_ipmilan or one of
>>         the brand-specific agents like fence_ilo for HP's iLO.
>>
>>
>>         On 09/13/2012 02:44 AM, Ben .T.George wrote:
>>
>>             Hi
>>
>>             i manually created a cluster.conf file and copied to my 2
>>             nodes..now
>>             it's working fine with one one NFS HA sare.i need to add
>>             three more shares..
>>
>>             please check this http://pastebin.com/eM08vrC5  this is
>>             my cluster.conf
>>
>>             how can i add three more shares to this cluster.conf file.?
>>
>>             please help..i got stacked with project.after testing
>>             this setup i need
>>             to implement on production
>>
>>
>>             Regards,
>>             Ben
>>
>>
>>
>>             On Wed, Sep 12, 2012 at 10:35 PM, Jan Pokorn?
>>             <jpokorny at redhat.com <mailto:jpokorny at redhat.com>
>>             <mailto:jpokorny at redhat.com
>>             <mailto:jpokorny at redhat.com>>> wrote:
>>
>>                 Hello Ben,
>>
>>                 On 12/09/12 16:39 +0300, Ben .T.George wrote:
>>                  > i created 2 node cluster with RHEL6 by using
>>             redhat cluster suite.
>>                  >
>>                  > i joined cluster nodes by using LUCI
>>                  >
>>                  > i created one IP as resource and source as that IP
>>             i created.
>>                  >
>>                  > i started cluster , but on luci status showing
>>             like disabled.but
>>                 that IP is
>>                  > pining and it is added on node2
>>                  >
>>                  > ip addr is showing that ip.
>>                  >
>>                  > #clustat  is showing both nodes online.
>>
>>                 actually thanks for bringing up what showed up to be
>>             a real issue [1].
>>                 Could you please try "service modclusterd start"
>>             across the nodes
>>                 (and perhaps
>>                 making the service persistent with chkconfig) to see
>>             if it helps you?
>>
>>                 In the mean time, this should be a workaround in such
>>             cases;
>>                 more decent solution for this bug is underway.
>>
>>                 [1] https://bugzilla.redhat.com/show_bug.cgi?id=856785
>>
>>                 Thanks,
>>                 Jan
>>
>>                 --
>>                 Linux-cluster mailing list
>>             Linux-cluster at redhat.com
>>             <mailto:Linux-cluster at redhat.com>
>>             <mailto:Linux-cluster at redhat.com
>>             <mailto:Linux-cluster at redhat.com>>
>>
>>             https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>>
>>             --
>>             Yours Sincerely
>>
>>             *#!/usr/bin/env python
>>             #Mysignature.py :)*
>>
>>
>>             Signature = " " " Ben.T.George \n
>>                                Linux System Administrator \n
>>                                Diyar United Company \n
>>                                kuwait \n
>>                                Phone : +965 - 50629829
>>             <tel:%2B965%20-%2050629829> \n " ""
>>
>>             Print Signature
>>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120913/024a0a72/attachment.htm>

From bentech4you at gmail.com  Thu Sep 13 10:09:53 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Thu, 13 Sep 2012 13:09:53 +0300
Subject: [Linux-cluster] cluster failed to start
In-Reply-To: <5051AD93.3020405@itechnical.de>
References: <CA+C_GOVRZ3ByyyJX2j=+P7bjF1MkJ3VGHr9287EP3RPYA1kfVg@mail.gmail.com>
	<20120912132714.GI22108@redhat.com>
	<CA+C_GOWtOwRSsWF4AqMoQawvS3a+aDqZKpGscLn4SkzkV3pqOw@mail.gmail.com>
	<20120912193522.GN22108@redhat.com>
	<CA+C_GOUH_EZMUhL3Os9ahs_bD2zRMxAHSWdzeuxtBDTi94TURg@mail.gmail.com>
	<505182DC.1010303@alteeve.ca>
	<CA+C_GOX8dKgb9hd-oFjoht5OCj=q8Yjqz0Lcbjpcgbct=YfFoA@mail.gmail.com>
	<5051A73A.10200@itechnical.de>
	<CA+C_GOVnZ7V61Wa1Wr_EAkOo2xua8S+-dQgqGJZq+_80D3-New@mail.gmail.com>
	<5051AD93.3020405@itechnical.de>
Message-ID: <CA+C_GOU3rLEe_sUz4irmMJ570BvsXQC13D0LvQw5Qy1WH_bwkQ@mail.gmail.com>

HI finally it got worked with 3 shares.

i manually edited cluster.conf file and added parameters

one more doubt

<nfsclient name="nfsClient" options="rw,async,no_root_squash"
target="192.168.2.138"/>

instead of one ip, how can i add IP range or a complete network.?


Regards,
Ben


On Thu, Sep 13, 2012 at 12:55 PM, Heiko Nardmann <
heiko.nardmann at itechnical.de> wrote:

>  Else you could have googled ...
>
> And maybe have found
> http://permalink.gmane.org/gmane.linux.highavailability.user/14360 ...
> which might have been helpful ...
>
>
> Kind regards,
>
>     Heiko
>
> Am 13.09.2012 11:40, schrieb Ben .T.George:
>
> HI
>
>  yes these machine have ILOM, but there is no ILOM fencing listed under
> LUCI page.
>
>  without fencing i can do testing noe.?
>
>  plese negligent fencing part and give me instructions.because in my
> production, we are using cisco UCS and the cisco UCS fencing is listed on
> LUCI page
>
>  regards,
> Ben
>
> On Thu, Sep 13, 2012 at 12:28 PM, Heiko Nardmann <
> heiko.nardmann at itechnical.de> wrote:
>
>>  Hi!
>>
>> Those machines should have some ALOM/ILOM/whatever mechanism which you
>> could use for power fencing e.g. ...
>>
>> You should check whether a corresponding fence agent exists then.
>>
>>
>> Kind regards,
>>
>>     Heiko
>>
>> Am 13.09.2012 10:11, schrieb Ben .T.George:
>>
>> HI
>>
>>  thanks for your reply..But on this test setup how can i configure
>> fencing..i am running 2 Sun X4200 machine.and the freenas is used to create
>> Iscsi.
>>
>>  my current NFS setup is working perfectly..and in the production setup
>> , i need to implement this on cisco UCS. i saw cisco UCS is there under
>> fencing options
>>
>>  please help me add three more nfs shares to my existing configuration..
>>
>>
>>  regards,
>> ben
>>
>> On Thu, Sep 13, 2012 at 9:53 AM, digimer <lists at alteeve.ca> wrote:
>>
>>> Please add fencing. Without it, the first time a node fails, your
>>> cluster will hang (by design). Most servers have IPMI (or similar), so you
>>> can probably use fence_ipmilan or one of the brand-specific agents like
>>> fence_ilo for HP's iLO.
>>>
>>>
>>> On 09/13/2012 02:44 AM, Ben .T.George wrote:
>>>
>>>>  Hi
>>>>
>>>> i manually created a cluster.conf file and copied to my 2 nodes..now
>>>> it's working fine with one one NFS HA sare.i need to add three more
>>>> shares..
>>>>
>>>> please check this http://pastebin.com/eM08vrC5  this is my cluster.conf
>>>>
>>>> how can i add three more shares to this cluster.conf file.?
>>>>
>>>> please help..i got stacked with project.after testing this setup i need
>>>> to implement on production
>>>>
>>>>
>>>> Regards,
>>>> Ben
>>>>
>>>>
>>>>
>>>> On Wed, Sep 12, 2012 at 10:35 PM, Jan Pokorn? <jpokorny at redhat.com
>>>>  <mailto:jpokorny at redhat.com>> wrote:
>>>>
>>>>     Hello Ben,
>>>>
>>>>     On 12/09/12 16:39 +0300, Ben .T.George wrote:
>>>>      > i created 2 node cluster with RHEL6 by using redhat cluster
>>>> suite.
>>>>      >
>>>>      > i joined cluster nodes by using LUCI
>>>>      >
>>>>      > i created one IP as resource and source as that IP i created.
>>>>      >
>>>>      > i started cluster , but on luci status showing like disabled.but
>>>>     that IP is
>>>>      > pining and it is added on node2
>>>>      >
>>>>      > ip addr is showing that ip.
>>>>      >
>>>>      > #clustat  is showing both nodes online.
>>>>
>>>>     actually thanks for bringing up what showed up to be a real issue
>>>> [1].
>>>>     Could you please try "service modclusterd start" across the nodes
>>>>     (and perhaps
>>>>     making the service persistent with chkconfig) to see if it helps
>>>> you?
>>>>
>>>>     In the mean time, this should be a workaround in such cases;
>>>>     more decent solution for this bug is underway.
>>>>
>>>>     [1] https://bugzilla.redhat.com/show_bug.cgi?id=856785
>>>>
>>>>     Thanks,
>>>>     Jan
>>>>
>>>>     --
>>>>     Linux-cluster mailing list
>>>>      Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>>>
>>>>     https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yours Sincerely
>>>>
>>>>  *#!/usr/bin/env python
>>>> #Mysignature.py :)*
>>>>
>>>>
>>>> Signature = " " " Ben.T.George \n
>>>>                    Linux System Administrator \n
>>>>                    Diyar United Company \n
>>>>                    kuwait \n
>>>>                    Phone : +965 - 50629829 <%2B965%20-%2050629829> \n
>>>> " ""
>>>>
>>>> Print Signature
>>>>
>>>>
>>
>
>
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120913/8baceba9/attachment.htm>

From bentech4you at gmail.com  Sat Sep 15 14:50:15 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Sat, 15 Sep 2012 17:50:15 +0300
Subject: [Linux-cluster] cluster with NFS export and mount on other node
Message-ID: <CA+C_GOUzdiK8utmwuUXpCTRQeeFKg+XoyoZ0WrD=OEDi5vHUFQ@mail.gmail.com>

HI

i am working on a different type of cluster request.

my requirement is :

mount three file-system on node1 at that same time these filesystem to be
NFS mounted on node2. vice versa

for this requirement which type of cluster approach i need to choose.?

which type of file system i need to choose (gfs, ext4..)

my OS is RHEL-6.3

please help me.

Regards,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120915/1f9f9d61/attachment.htm>

From heiko.nardmann at itechnical.de  Sat Sep 15 15:20:14 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Sat, 15 Sep 2012 17:20:14 +0200
Subject: [Linux-cluster] cluster with NFS export and mount on other node
In-Reply-To: <CA+C_GOUzdiK8utmwuUXpCTRQeeFKg+XoyoZ0WrD=OEDi5vHUFQ@mail.gmail.com>
References: <CA+C_GOUzdiK8utmwuUXpCTRQeeFKg+XoyoZ0WrD=OEDi5vHUFQ@mail.gmail.com>
Message-ID: <50549CAE.5060801@itechnical.de>

I would recommend to ask RedHat for support concerning your intended setup.

First they of course have much more experience; second then you know 
whether your system is a supported setup.

They also provide support before(!) a system is implemented.

Where does your NFS data come from? iSCSI SAN?


Kind regards,

     Heiko

Am 15.09.2012 16:50, schrieb Ben .T.George:
> HI
>
> i am working on a different type of cluster request.
>
> my requirement is :
>
> mount three file-system on node1 at that same time these filesystem to 
> be NFS mounted on node2. vice versa
>
> for this requirement which type of cluster approach i need to choose.?
>
> which type of file system i need to choose (gfs, ext4..)
>
> my OS is RHEL-6.3
>
> please help me.
>
> Regards,
> Ben
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120915/e6719fcc/attachment.htm>

From bentech4you at gmail.com  Sat Sep 15 15:27:56 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Sat, 15 Sep 2012 18:27:56 +0300
Subject: [Linux-cluster] cluster with NFS export and mount on other node
In-Reply-To: <50549CAE.5060801@itechnical.de>
References: <CA+C_GOUzdiK8utmwuUXpCTRQeeFKg+XoyoZ0WrD=OEDi5vHUFQ@mail.gmail.com>
	<50549CAE.5060801@itechnical.de>
Message-ID: <CA+C_GOWB8hjd_BDLQZJ6ya2076RJ+165QB+bPywiz1Vn+nA36w@mail.gmail.com>

HI

thanks for your reply.

actually we don't have cluster support now..i requested management to
purchase that..but it will take bit time..before that i need to finish this
setup

but without Cluster support redhat people wouldn't help.

these machines are connected to EMC Storage(VNX 5300).


On Sat, Sep 15, 2012 at 6:20 PM, Heiko Nardmann <
heiko.nardmann at itechnical.de> wrote:

>  I would recommend to ask RedHat for support concerning your intended
> setup.
>
> First they of course have much more experience; second then you know
> whether your system is a supported setup.
>
> They also provide support before(!) a system is implemented.
>
> Where does your NFS data come from? iSCSI SAN?
>
>
> Kind regards,
>
>     Heiko
>
> Am 15.09.2012 16:50, schrieb Ben .T.George:
>
> HI
>
>  i am working on a different type of cluster request.
>
>  my requirement is :
>
>  mount three file-system on node1 at that same time these filesystem to
> be NFS mounted on node2. vice versa
>
>  for this requirement which type of cluster approach i need to choose.?
>
>  which type of file system i need to choose (gfs, ext4..)
>
>  my OS is RHEL-6.3
>
>  please help me.
>
>  Regards,
> Ben
>
>
>
>
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120915/b01b5ed5/attachment.htm>

From bentech4you at gmail.com  Sun Sep 16 10:46:48 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Sun, 16 Sep 2012 13:46:48 +0300
Subject: [Linux-cluster] How to add shell script to cluster.conf
Message-ID: <CA+C_GOXNwVLy-U0WZ=nKbdpSMF4-RKSpNjBaw2d28zZFia3EBg@mail.gmail.com>

Hi


I have an NFS HA setup. how can i add a custom shell script to that
resource group

 NFS HA services are working well..i am working with cluster.conf file
directly..please help me to add this.i want to touch some information after
exporting this filesystem

Regards,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120916/67cb3510/attachment.htm>

From parvez.h.shaikh at gmail.com  Sun Sep 16 12:13:27 2012
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Sun, 16 Sep 2012 17:43:27 +0530
Subject: [Linux-cluster] How to add shell script to cluster.conf
In-Reply-To: <CA+C_GOXNwVLy-U0WZ=nKbdpSMF4-RKSpNjBaw2d28zZFia3EBg@mail.gmail.com>
References: <CA+C_GOXNwVLy-U0WZ=nKbdpSMF4-RKSpNjBaw2d28zZFia3EBg@mail.gmail.com>
Message-ID: <CAKrd5320JRt5SjBxuuyctx9CNQ_2kg0H8WQEdCX_wQrVjNsA+w@mail.gmail.com>

>From this link -

https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/4/html/Cluster_Administration/s1-config-service-dev-CA.html

Script
*Name* ? Enter a name for the custom user script.
 *File (with path)* ? Enter the path where this custom script is located
(for example, /etc/init.d/*userscript*)


On Sun, Sep 16, 2012 at 4:16 PM, Ben .T.George <bentech4you at gmail.com>wrote:

> Hi
>
>
> I have an NFS HA setup. how can i add a custom shell script to that
> resource group
>
>  NFS HA services are working well..i am working with cluster.conf file
> directly..please help me to add this.i want to touch some information after
> exporting this filesystem
>
> Regards,
> Ben
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120916/10534d21/attachment.htm>

From bentech4you at gmail.com  Mon Sep 17 05:42:28 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Mon, 17 Sep 2012 08:42:28 +0300
Subject: [Linux-cluster] how to add NFS mount shell script
Message-ID: <CA+C_GOXJXV_TFiVyh=fxSAVqUCf+rodDY1uWGcBQr_BUP0_odQ@mail.gmail.com>

HI

I created 2 node NFS cluster with RHEL6. my current status is file System
mounting and NFS services are working well

If the cluster service is started on NODE1, i need to mount one of the NFS
directory on NODE2

i created a script and added last on my Service group. I did password less
ssh between these nodes and on script i put below entry

ssh node2 mount 192.168.1.1:/disk  /local

where 192.168.1.1 is my cluster resource IP.

The problem is when i start the service, cluster is failing.even i tried to
mount on NODE1 instead of NODE2. ath that time cluster started, after
sometime it's failing

please help me solve this.Is any custom script will do this thing for me.?


Thanks & Regards,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120917/9c4dcc14/attachment.htm>

From bentech4you at gmail.com  Mon Sep 17 15:52:48 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Mon, 17 Sep 2012 18:52:48 +0300
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
Message-ID: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>

HI

i am just started building 2 node cluster.i installed all packages of red
hat cluster suite by mounting RHEL 6 dvd.

i joined cluster by using LUCI.after that my clustat showing like this:


on node1:

Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 cgceccprd1.combinedgroup.net                1 Online, Local
 cgceccprd2.combinedgroup.net                2 Offline


on node2:

Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 cgceccprd1.combinedgroup.net                1 Offline
 cgceccprd2.combinedgroup.net                2 Online, Local

both nodes showing different status.
i restarted many times, i deleted and created many times..then also
same..please help me solve this

Regards,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120917/be36955e/attachment.htm>

From parvez.h.shaikh at gmail.com  Mon Sep 17 16:06:44 2012
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Mon, 17 Sep 2012 21:36:44 +0530
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
Message-ID: <CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>

Had similar issues however I was using RHEL 5.5

Please refer - https://access.redhat.com/knowledge/solutions/18542


On Mon, Sep 17, 2012 at 9:22 PM, Ben .T.George <bentech4you at gmail.com>wrote:

>
>
> HI
>
> i am just started building 2 node cluster.i installed all packages of red
> hat cluster suite by mounting RHEL 6 dvd.
>
> i joined cluster by using LUCI.after that my clustat showing like this:
>
>
> on node1:
>
> Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
> Member Status: Quorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  cgceccprd1.combinedgroup.net                1 Online, Local
>  cgceccprd2.combinedgroup.net                2 Offline
>
>
> on node2:
>
> Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
> Member Status: Quorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  cgceccprd1.combinedgroup.net                1 Offline
>  cgceccprd2.combinedgroup.net                2 Online, Local
>
> both nodes showing different status.
> i restarted many times, i deleted and created many times..then also
> same..please help me solve this
>
> Regards,
> Ben
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120917/740b3b7c/attachment.htm>

From bentech4you at gmail.com  Mon Sep 17 16:23:46 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Mon, 17 Sep 2012 19:23:46 +0300
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
	<CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
Message-ID: <CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>

hi

thanks for your reply

actually mine is Cisco UCS.

my cluster.conf is very basic.

can anyone please share cluster.conf with me..so that i can check with
mine..

please


regards,
Ben

On Mon, Sep 17, 2012 at 7:06 PM, Parvez Shaikh <parvez.h.shaikh at gmail.com>wrote:

> Had similar issues however I was using RHEL 5.5
>
> Please refer - https://access.redhat.com/knowledge/solutions/18542
>
>
> On Mon, Sep 17, 2012 at 9:22 PM, Ben .T.George <bentech4you at gmail.com>wrote:
>
>>
>>
>> HI
>>
>> i am just started building 2 node cluster.i installed all packages of red
>> hat cluster suite by mounting RHEL 6 dvd.
>>
>> i joined cluster by using LUCI.after that my clustat showing like this:
>>
>>
>> on node1:
>>
>> Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
>> Member Status: Quorate
>>
>>  Member Name                             ID   Status
>>  ------ ----                             ---- ------
>>  cgceccprd1.combinedgroup.net                1 Online, Local
>>  cgceccprd2.combinedgroup.net                2 Offline
>>
>>
>> on node2:
>>
>> Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
>> Member Status: Quorate
>>
>>  Member Name                             ID   Status
>>  ------ ----                             ---- ------
>>  cgceccprd1.combinedgroup.net                1 Offline
>>  cgceccprd2.combinedgroup.net                2 Online, Local
>>
>> both nodes showing different status.
>> i restarted many times, i deleted and created many times..then also
>> same..please help me solve this
>>
>> Regards,
>> Ben
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120917/e782a3e9/attachment.htm>

From emi2fast at gmail.com  Mon Sep 17 20:18:44 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 17 Sep 2012 22:18:44 +0200
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
	<CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
	<CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>
Message-ID: <CAE7pJ3BPgU=-3NG82SpkwpP9BptW-sCfCYNyn3y-y-MJ59rjOw@mail.gmail.com>

When you found the cluster in that state, the first thing it's check your
multicast traffic

2012/9/17 Ben .T.George <bentech4you at gmail.com>

> hi
>
> thanks for your reply
>
> actually mine is Cisco UCS.
>
> my cluster.conf is very basic.
>
> can anyone please share cluster.conf with me..so that i can check with
> mine..
>
> please
>
>
> regards,
> Ben
>
>
> On Mon, Sep 17, 2012 at 7:06 PM, Parvez Shaikh <parvez.h.shaikh at gmail.com>wrote:
>
>> Had similar issues however I was using RHEL 5.5
>>
>> Please refer - https://access.redhat.com/knowledge/solutions/18542
>>
>>
>> On Mon, Sep 17, 2012 at 9:22 PM, Ben .T.George <bentech4you at gmail.com>wrote:
>>
>>>
>>>
>>> HI
>>>
>>> i am just started building 2 node cluster.i installed all packages of
>>> red hat cluster suite by mounting RHEL 6 dvd.
>>>
>>> i joined cluster by using LUCI.after that my clustat showing like this:
>>>
>>>
>>> on node1:
>>>
>>> Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
>>> Member Status: Quorate
>>>
>>>  Member Name                             ID   Status
>>>  ------ ----                             ---- ------
>>>  cgceccprd1.combinedgroup.net                1 Online, Local
>>>  cgceccprd2.combinedgroup.net                2 Offline
>>>
>>>
>>> on node2:
>>>
>>> Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
>>> Member Status: Quorate
>>>
>>>  Member Name                             ID   Status
>>>  ------ ----                             ---- ------
>>>  cgceccprd1.combinedgroup.net                1 Offline
>>>  cgceccprd2.combinedgroup.net                2 Online, Local
>>>
>>> both nodes showing different status.
>>> i restarted many times, i deleted and created many times..then also
>>> same..please help me solve this
>>>
>>> Regards,
>>> Ben
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>
>
> --
> Yours Sincerely
>
> *#!/usr/bin/env python
> #Mysignature.py :)*
>
> Signature = " " " Ben.T.George \n
>                   Linux System Administrator \n
>                   Diyar United Company \n
>                   kuwait \n
>                   Phone : +965 - 50629829 \n " " "
>
> Print Signature
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120917/a6694040/attachment.htm>

From bentech4you at gmail.com  Mon Sep 17 20:56:27 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Mon, 17 Sep 2012 23:56:27 +0300
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <CAE7pJ3BPgU=-3NG82SpkwpP9BptW-sCfCYNyn3y-y-MJ59rjOw@mail.gmail.com>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
	<CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
	<CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>
	<CAE7pJ3BPgU=-3NG82SpkwpP9BptW-sCfCYNyn3y-y-MJ59rjOw@mail.gmail.com>
Message-ID: <CA+C_GOWdWZqxe_QBw+u8X04tX+XGd0i=A-fZ1bzUk-GQyXv8eQ@mail.gmail.com>

HI

my cman_tool status is showing multicast IP address like below:

Multicast addresses : 239.192.140.34

i tried to ping on this IP.but it's not pining..i don't know more about
multicast configuration..

please help me to check multicast more.


regards,
Ben

On Mon, Sep 17, 2012 at 11:18 PM, emmanuel segura <emi2fast at gmail.com>wrote:

> When you found the cluster in that state, the first thing it's check your
> multicast traffic
>
> 2012/9/17 Ben .T.George <bentech4you at gmail.com>
>
> hi
>>
>> thanks for your reply
>>
>> actually mine is Cisco UCS.
>>
>> my cluster.conf is very basic.
>>
>> can anyone please share cluster.conf with me..so that i can check with
>> mine..
>>
>> please
>>
>>
>> regards,
>> Ben
>>
>>
>> On Mon, Sep 17, 2012 at 7:06 PM, Parvez Shaikh <parvez.h.shaikh at gmail.com
>> > wrote:
>>
>>> Had similar issues however I was using RHEL 5.5
>>>
>>> Please refer - https://access.redhat.com/knowledge/solutions/18542
>>>
>>>
>>> On Mon, Sep 17, 2012 at 9:22 PM, Ben .T.George <bentech4you at gmail.com>wrote:
>>>
>>>>
>>>>
>>>> HI
>>>>
>>>> i am just started building 2 node cluster.i installed all packages of
>>>> red hat cluster suite by mounting RHEL 6 dvd.
>>>>
>>>> i joined cluster by using LUCI.after that my clustat showing like this:
>>>>
>>>>
>>>> on node1:
>>>>
>>>> Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
>>>> Member Status: Quorate
>>>>
>>>>  Member Name                             ID   Status
>>>>  ------ ----                             ---- ------
>>>>  cgceccprd1.combinedgroup.net                1 Online, Local
>>>>  cgceccprd2.combinedgroup.net                2 Offline
>>>>
>>>>
>>>> on node2:
>>>>
>>>> Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012
>>>> Member Status: Quorate
>>>>
>>>>  Member Name                             ID   Status
>>>>  ------ ----                             ---- ------
>>>>  cgceccprd1.combinedgroup.net                1 Offline
>>>>  cgceccprd2.combinedgroup.net                2 Online, Local
>>>>
>>>> both nodes showing different status.
>>>> i restarted many times, i deleted and created many times..then also
>>>> same..please help me solve this
>>>>
>>>> Regards,
>>>> Ben
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>
>>
>> --
>> Yours Sincerely
>>
>> *#!/usr/bin/env python
>> #Mysignature.py :)*
>>
>> Signature = " " " Ben.T.George \n
>>                   Linux System Administrator \n
>>                   Diyar United Company \n
>>                   kuwait \n
>>                   Phone : +965 - 50629829 \n " " "
>>
>> Print Signature
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120917/f1068213/attachment.htm>

From bentech4you at gmail.com  Mon Sep 17 22:07:07 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Tue, 18 Sep 2012 01:07:07 +0300
Subject: [Linux-cluster] cluster fenced error
Message-ID: <CA+C_GOV08RN0bv4v--xVyY1hkU54r4ntVyXvHk4UgUtWjt3sSg@mail.gmail.com>

Hi

My cluster is failing to start.

if i check clustat on node1, status is showing node1 online and node2
offline. If the check clustat on node2, node2 is showing online and node1
is offline

i checked logs.fanced is throwing errors.how can i rectify this

Sep 17 23:24:54 fenced fencing node cgceccprd1.combinedgroup.net still
retrying
Sep 17 23:55:06 fenced fencing node cgceccprd1.combinedgroup.net still
retrying
Sep 18 00:25:19 fenced fencing node cgceccprd1.combinedgroup.net still
retrying
Sep 18 00:55:03 fenced fenced 3.0.12.1 started
Sep 18 00:55:03 fenced failed to get dbus connection
Sep 18 00:55:55 fenced fencing node cgceccprd1.combinedgroup.net
Sep 18 00:55:55 fenced fence cgceccprd1.combinedgroup.net dev 0.0 agent
none result: error no method
Sep 18 00:55:55 fenced fence cgceccprd1.combinedgroup.net failed
Sep 18 00:55:58 fenced fencing node cgceccprd1.combinedgroup.net
Sep 18 00:55:58 fenced fence cgceccprd1.combinedgroup.net dev 0.0 agent
none result: error no method
Sep 18 00:55:58 fenced fence cgceccprd1.combinedgroup.net failed
Sep 18 00:56:01 fenced fencing node cgceccprd1.combinedgroup.net
Sep 18 00:56:01 fenced fence cgceccprd1.combinedgroup.net dev 0.0 agent
none result: error no method
Sep 18 00:56:01 fenced fence cgceccprd1.combinedgroup.net failed


please help me solve this issue

Regards,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120918/813e7c40/attachment.htm>

From lists at alteeve.ca  Tue Sep 18 01:42:21 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 17 Sep 2012 21:42:21 -0400
Subject: [Linux-cluster] cluster fenced error
In-Reply-To: <CA+C_GOV08RN0bv4v--xVyY1hkU54r4ntVyXvHk4UgUtWjt3sSg@mail.gmail.com>
References: <CA+C_GOV08RN0bv4v--xVyY1hkU54r4ntVyXvHk4UgUtWjt3sSg@mail.gmail.com>
Message-ID: <5057D17D.9060108@alteeve.ca>

On 09/17/2012 06:07 PM, Ben .T.George wrote:
> Hi
>
> My cluster is failing to start.
>
> if i check clustat on node1, status is showing node1 online and node2
> offline. If the check clustat on node2, node2 is showing online and
> node1 is offline
>
> i checked logs.fanced is throwing errors.how can i rectify this
>
> Sep 17 23:24:54 fenced fencing node cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> still retrying
> Sep 17 23:55:06 fenced fencing node cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> still retrying
> Sep 18 00:25:19 fenced fencing node cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> still retrying
> Sep 18 00:55:03 fenced fenced 3.0.12.1 started
> Sep 18 00:55:03 fenced failed to get dbus connection
> Sep 18 00:55:55 fenced fencing node cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net>
> Sep 18 00:55:55 fenced fence cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> dev 0.0 agent none result: error
> no method
> Sep 18 00:55:55 fenced fence cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> failed
> Sep 18 00:55:58 fenced fencing node cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net>
> Sep 18 00:55:58 fenced fence cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> dev 0.0 agent none result: error
> no method
> Sep 18 00:55:58 fenced fence cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> failed
> Sep 18 00:56:01 fenced fencing node cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net>
> Sep 18 00:56:01 fenced fence cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> dev 0.0 agent none result: error
> no method
> Sep 18 00:56:01 fenced fence cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> failed
>
>
> please help me solve this issue
>
> Regards,
> Ben

What is your cluster.conf?

likely you either have no fencing configured, or your fencing is not 
working. Either way, failing to fence is a critical problem and the 
cluster will hang, just as you're seeing here. This is by design. Better 
to hang a cluster than to corrupt it.

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca


From lists at alteeve.ca  Tue Sep 18 02:03:29 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 17 Sep 2012 22:03:29 -0400
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <CA+C_GOWdWZqxe_QBw+u8X04tX+XGd0i=A-fZ1bzUk-GQyXv8eQ@mail.gmail.com>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
	<CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
	<CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>
	<CAE7pJ3BPgU=-3NG82SpkwpP9BptW-sCfCYNyn3y-y-MJ59rjOw@mail.gmail.com>
	<CA+C_GOWdWZqxe_QBw+u8X04tX+XGd0i=A-fZ1bzUk-GQyXv8eQ@mail.gmail.com>
Message-ID: <5057D671.4020503@alteeve.ca>

On 09/17/2012 04:56 PM, Ben .T.George wrote:
> HI
>
> my cman_tool status is showing multicast IP address like below:
>
> Multicast addresses : 239.192.140.34
>
> i tried to ping on this IP.but it's not pining..i don't know more about
> multicast configuration..
>
> please help me to check multicast more.

Multicast IPs do not represent a specific machine, but a group. Think of 
multicast as a sort of "mailing list"; A machine "subscribes" to the 
multicast group and the switch then says "right, when a packet comes in 
addressed to the multicast group, forward a copy to all subscribed 
machines". With Cisco, you need to create persistent multicast groups in 
the switch, if I recall correctly (I don't use Cisco myself).

In your case though, you're showing "cgceccprd1.combinedgroup.net" as 
"Offline" from both nodes, so the cluster is simply not starting.

Can you paste your cluster.conf please? Also, can you run 'tail -f -n 0 
/var/log/messages' in a terminal on cgceccprd1, try to start the 
cluster, wait for it to fail and then paste the output along with your 
configuration?

This might help (the "Overview" section at the start, if nothing else). 
There is a sample cluster.conf there.

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

hope that helps

-- 
Digimer
Papers and Projects: https://alteeve.ca


From bentech4you at gmail.com  Tue Sep 18 03:17:24 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Tue, 18 Sep 2012 06:17:24 +0300
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <5057D671.4020503@alteeve.ca>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
	<CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
	<CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>
	<CAE7pJ3BPgU=-3NG82SpkwpP9BptW-sCfCYNyn3y-y-MJ59rjOw@mail.gmail.com>
	<CA+C_GOWdWZqxe_QBw+u8X04tX+XGd0i=A-fZ1bzUk-GQyXv8eQ@mail.gmail.com>
	<5057D671.4020503@alteeve.ca>
Message-ID: <CA+C_GOUnA+UuHg8w4zYy2omy=1+14qr=DuQoVWwXcpMqHkeW4g@mail.gmail.com>

Hi thanks for your reply

Beloe is my cluster.conffile

<?xml version="1.0"?>
<cluster config_version="7" name="eccprd">
        <clusternodes>
                <clusternode name="cgceccprd1.combinedgroup.net" nodeid="1">
                        <fence>
                                <method name="ucs-node1"/>
                        </fence>
                </clusternode>
                <clusternode name="cgceccprd2.combinedgroup.net" nodeid="2">
                        <fence>
                                <method name="ucs-node2"/>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <rm>
                <resources>
                        <ip address="172.22.10.230" sleeptime="10"/>
                </resources>
                <service exclusive="1" name="eccsapmnt" recovery="relocate">
                        <ip ref="172.22.10.230"/>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_cisco_ucs" ipaddr="172.22.90.61"
login="admin" name="ucs-node1" passwd="duc2Cisco"/>
                <fencedevice agent="fence_cisco_ucs" ipaddr="172.22.90.59"
login="admin" name="ucs-node2" passwd="duc2Cisco"/>
        </fencedevices>
</cluster>

when i try to start cluster on node1, i am geeting this message on mesages:

 tail -f -n 0 /var/log/messages
Sep 18 06:06:02 cgceccprd1 modcluster: Starting service: eccsapmnt on node
Sep 18 06:06:08 cgceccprd1 modcluster: Starting service: eccsapmnt on node
cgceccprd1.combinedgroup.net


but the service is not starting.on luci , it's showing both nodes are
online.but on clustat different

main error getting on messages is

Sep 18 03:35:48 cgceccprd1 fenced[8424]: fencing node
cgceccprd2.combinedgroup.net still retrying
Sep 18 04:06:16 cgceccprd1 fenced[8424]: fencing node
cgceccprd2.combinedgroup.net still retrying
Sep 18 04:36:45 cgceccprd1 fenced[8424]: fencing node
cgceccprd2.combinedgroup.net still retrying
Sep 18 05:07:14 cgceccprd1 fenced[8424]: fencing node
cgceccprd2.combinedgroup.net still retrying
Sep 18 05:37:42 cgceccprd1 fenced[8424]: fencing node
cgceccprd2.combinedgroup.net still retrying

These messages from node1.i am geeting same message on node saying that

cgceccprd2 fenced[8424]: fencing node cgceccprd1.combinedgroup.net still
retrying

i don't know what is problem here.

please help me solve
Regards,
Ben


On Tue, Sep 18, 2012 at 5:03 AM, Digimer <lists at alteeve.ca> wrote:

> On 09/17/2012 04:56 PM, Ben .T.George wrote:
>
>> HI
>>
>> my cman_tool status is showing multicast IP address like below:
>>
>> Multicast addresses : 239.192.140.34
>>
>> i tried to ping on this IP.but it's not pining..i don't know more about
>> multicast configuration..
>>
>> please help me to check multicast more.
>>
>
> Multicast IPs do not represent a specific machine, but a group. Think of
> multicast as a sort of "mailing list"; A machine "subscribes" to the
> multicast group and the switch then says "right, when a packet comes in
> addressed to the multicast group, forward a copy to all subscribed
> machines". With Cisco, you need to create persistent multicast groups in
> the switch, if I recall correctly (I don't use Cisco myself).
>
> In your case though, you're showing "cgceccprd1.combinedgroup.net" as
> "Offline" from both nodes, so the cluster is simply not starting.
>
> Can you paste your cluster.conf please? Also, can you run 'tail -f -n 0
> /var/log/messages' in a terminal on cgceccprd1, try to start the cluster,
> wait for it to fail and then paste the output along with your configuration?
>
> This might help (the "Overview" section at the start, if nothing else).
> There is a sample cluster.conf there.
>
> https://alteeve.ca/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial<https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial>
>
> hope that helps
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120918/f95922d8/attachment.htm>

From lists at alteeve.ca  Tue Sep 18 03:25:57 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 17 Sep 2012 23:25:57 -0400
Subject: [Linux-cluster] cluster fenced error
In-Reply-To: <CA+C_GOXhjW8aeEb586Oum81Qn+u2ryKNwx1pDHA_a7v6WBrcgw@mail.gmail.com>
References: <CA+C_GOV08RN0bv4v--xVyY1hkU54r4ntVyXvHk4UgUtWjt3sSg@mail.gmail.com>
	<5057D17D.9060108@alteeve.ca>
	<CA+C_GOXhjW8aeEb586Oum81Qn+u2ryKNwx1pDHA_a7v6WBrcgw@mail.gmail.com>
Message-ID: <5057E9C5.60506@alteeve.ca>

You have two problems;

1. The nodes can't talk to each other (via multicast) *or* you are 
taking too long to start each node. Given that you are using luci, I am 
guessing the former. Log into your switch and see if the multicast group 
shown in 'cman_tool status' exists.

2. Your fencing isn't working. Read the man page for fence_cisco_ucs to 
try and debug it.

digimer

PS - Please don't reply directly to me. Keep the conversation public.
PPS - Filter out your passwords. ;)

On 09/17/2012 11:17 PM, Ben .T.George wrote:
> Hi thanks for your reply
>
> Beloe is my cluster.conffile
>
> <?xml version="1.0"?>
> <cluster config_version="7" name="eccprd">
>          <clusternodes>
>                  <clusternode name="cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net>" nodeid="1">
>                          <fence>
>                                  <method name="ucs-node1"/>
>                          </fence>
>                  </clusternode>
>                  <clusternode name="cgceccprd2.combinedgroup.net
> <http://cgceccprd2.combinedgroup.net>" nodeid="2">
>                          <fence>
>                                  <method name="ucs-node2"/>
>                          </fence>
>                  </clusternode>
>          </clusternodes>
>          <cman expected_votes="1" two_node="1"/>
>          <rm>
>                  <resources>
>                          <ip address="172.22.10.230" sleeptime="10"/>
>                  </resources>
>                  <service exclusive="1" name="eccsapmnt"
> recovery="relocate">
>                          <ip ref="172.22.10.230"/>
>                  </service>
>          </rm>
>          <fencedevices>
>                  <fencedevice agent="fence_cisco_ucs"
> ipaddr="172.22.90.61" login="admin" name="ucs-node1" passwd="..."/>
>                  <fencedevice agent="fence_cisco_ucs"
> ipaddr="172.22.90.59" login="admin" name="ucs-node2" passwd="..."/>
>          </fencedevices>
> </cluster>
>
> when i try to start cluster on node1, i am geeting this message on mesages:
>
>   tail -f -n 0 /var/log/messages
> Sep 18 06:06:02 cgceccprd1 modcluster: Starting service: eccsapmnt on node
> Sep 18 06:06:08 cgceccprd1 modcluster: Starting service: eccsapmnt on
> node cgceccprd1.combinedgroup.net <http://cgceccprd1.combinedgroup.net>
>
>
> but the service is not starting.on luci , it's showing both nodes are
> online.but on clustat different
>
> main error getting on messages is
>
> Sep 18 03:35:48 cgceccprd1 fenced[8424]: fencing node
> cgceccprd2.combinedgroup.net <http://cgceccprd2.combinedgroup.net> still
> retrying
> Sep 18 04:06:16 cgceccprd1 fenced[8424]: fencing node
> cgceccprd2.combinedgroup.net <http://cgceccprd2.combinedgroup.net> still
> retrying
> Sep 18 04:36:45 cgceccprd1 fenced[8424]: fencing node
> cgceccprd2.combinedgroup.net <http://cgceccprd2.combinedgroup.net> still
> retrying
> Sep 18 05:07:14 cgceccprd1 fenced[8424]: fencing node
> cgceccprd2.combinedgroup.net <http://cgceccprd2.combinedgroup.net> still
> retrying
> Sep 18 05:37:42 cgceccprd1 fenced[8424]: fencing node
> cgceccprd2.combinedgroup.net <http://cgceccprd2.combinedgroup.net> still
> retrying
>
> These messages from node1.i am geeting same message on node saying that
>
> cgceccprd2 fenced[8424]: fencing node cgceccprd1.combinedgroup.net
> <http://cgceccprd1.combinedgroup.net> still retrying
>
> i don't know what is problem here.
>
> please help me solve
> Regards,
> Ben
>
> On Tue, Sep 18, 2012 at 4:42 AM, Digimer <lists at alteeve.ca
> <mailto:lists at alteeve.ca>> wrote:
>
>     On 09/17/2012 06:07 PM, Ben .T.George wrote:
>
>         Hi
>
>         My cluster is failing to start.
>
>         if i check clustat on node1, status is showing node1 online and
>         node2
>         offline. If the check clustat on node2, node2 is showing online and
>         node1 is offline
>
>         i checked logs.fanced is throwing errors.how can i rectify this
>
>         Sep 17 23:24:54 fenced fencing node cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>> still retrying
>
>         Sep 17 23:55:06 fenced fencing node cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>> still retrying
>
>         Sep 18 00:25:19 fenced fencing node cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>> still retrying
>
>         Sep 18 00:55:03 fenced fenced 3.0.12.1 started
>         Sep 18 00:55:03 fenced failed to get dbus connection
>         Sep 18 00:55:55 fenced fencing node cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>>
>
>         Sep 18 00:55:55 fenced fence cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>> dev 0.0 agent none
>         result: error
>
>         no method
>         Sep 18 00:55:55 fenced fence cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>> failed
>
>         Sep 18 00:55:58 fenced fencing node cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>>
>
>         Sep 18 00:55:58 fenced fence cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>> dev 0.0 agent none
>         result: error
>
>         no method
>         Sep 18 00:55:58 fenced fence cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>> failed
>
>         Sep 18 00:56:01 fenced fencing node cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>>
>
>         Sep 18 00:56:01 fenced fence cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>> dev 0.0 agent none
>         result: error
>
>         no method
>         Sep 18 00:56:01 fenced fence cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>> failed
>
>
>
>         please help me solve this issue
>
>         Regards,
>         Ben
>
>
>     What is your cluster.conf?
>
>     likely you either have no fencing configured, or your fencing is not
>     working. Either way, failing to fence is a critical problem and the
>     cluster will hang, just as you're seeing here. This is by design.
>     Better to hang a cluster than to corrupt it.
>
>     digimer
>
>     --
>     Digimer
>     Papers and Projects: https://alteeve.ca
>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca


From lists at alteeve.ca  Tue Sep 18 03:26:53 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 17 Sep 2012 23:26:53 -0400
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <CA+C_GOUnA+UuHg8w4zYy2omy=1+14qr=DuQoVWwXcpMqHkeW4g@mail.gmail.com>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
	<CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
	<CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>
	<CAE7pJ3BPgU=-3NG82SpkwpP9BptW-sCfCYNyn3y-y-MJ59rjOw@mail.gmail.com>
	<CA+C_GOWdWZqxe_QBw+u8X04tX+XGd0i=A-fZ1bzUk-GQyXv8eQ@mail.gmail.com>
	<5057D671.4020503@alteeve.ca>
	<CA+C_GOUnA+UuHg8w4zYy2omy=1+14qr=DuQoVWwXcpMqHkeW4g@mail.gmail.com>
Message-ID: <5057E9FD.6020408@alteeve.ca>

On 09/17/2012 11:17 PM, Ben .T.George wrote:
>   tail -f -n 0 /var/log/messages
> Sep 18 06:06:02 cgceccprd1 modcluster: Starting service: eccsapmnt on node
> Sep 18 06:06:08 cgceccprd1 modcluster: Starting service: eccsapmnt on
> node cgceccprd1.combinedgroup.net <http://cgceccprd1.combinedgroup.net>

In addition to my other reply;

There should be a *lot* more output than this. That alone is useless.

-- 
Digimer
Papers and Projects: https://alteeve.ca


From bentech4you at gmail.com  Tue Sep 18 03:37:23 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Tue, 18 Sep 2012 06:37:23 +0300
Subject: [Linux-cluster] cluster fenced error
In-Reply-To: <5057E9C5.60506@alteeve.ca>
References: <CA+C_GOV08RN0bv4v--xVyY1hkU54r4ntVyXvHk4UgUtWjt3sSg@mail.gmail.com>
	<5057D17D.9060108@alteeve.ca>
	<CA+C_GOXhjW8aeEb586Oum81Qn+u2ryKNwx1pDHA_a7v6WBrcgw@mail.gmail.com>
	<5057E9C5.60506@alteeve.ca>
Message-ID: <CA+C_GOW=epBq8N-XFhzQ=Fij+Ahz6J-9B+gASyb5PGi31v=8RQ@mail.gmail.com>

Hi thanks for your reply

This is Cisco UCS machine. yesterday cisco guys created a separate vswitch
for this heartbeat.

regards,
Ben

On Tue, Sep 18, 2012 at 6:25 AM, Digimer <lists at alteeve.ca> wrote:

> You have two problems;
>
> 1. The nodes can't talk to each other (via multicast) *or* you are taking
> too long to start each node. Given that you are using luci, I am guessing
> the former. Log into your switch and see if the multicast group shown in
> 'cman_tool status' exists.
>
> 2. Your fencing isn't working. Read the man page for fence_cisco_ucs to
> try and debug it.
>
> digimer
>
> PS - Please don't reply directly to me. Keep the conversation public.
> PPS - Filter out your passwords. ;)
>
>
> On 09/17/2012 11:17 PM, Ben .T.George wrote:
>
>> Hi thanks for your reply
>>
>> Beloe is my cluster.conffile
>>
>> <?xml version="1.0"?>
>> <cluster config_version="7" name="eccprd">
>>          <clusternodes>
>>                  <clusternode name="cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>"
>> nodeid="1">
>>
>>                          <fence>
>>                                  <method name="ucs-node1"/>
>>                          </fence>
>>                  </clusternode>
>>                  <clusternode name="cgceccprd2.**combinedgroup.net<http://cgceccprd2.combinedgroup.net>
>> <http://cgceccprd2.**combinedgroup.net<http://cgceccprd2.combinedgroup.net>>"
>> nodeid="2">
>>
>>                          <fence>
>>                                  <method name="ucs-node2"/>
>>                          </fence>
>>                  </clusternode>
>>          </clusternodes>
>>          <cman expected_votes="1" two_node="1"/>
>>          <rm>
>>                  <resources>
>>                          <ip address="172.22.10.230" sleeptime="10"/>
>>                  </resources>
>>                  <service exclusive="1" name="eccsapmnt"
>> recovery="relocate">
>>                          <ip ref="172.22.10.230"/>
>>                  </service>
>>          </rm>
>>          <fencedevices>
>>                  <fencedevice agent="fence_cisco_ucs"
>> ipaddr="172.22.90.61" login="admin" name="ucs-node1" passwd="..."/>
>>                  <fencedevice agent="fence_cisco_ucs"
>> ipaddr="172.22.90.59" login="admin" name="ucs-node2" passwd="..."/>
>>
>>          </fencedevices>
>> </cluster>
>>
>> when i try to start cluster on node1, i am geeting this message on
>> mesages:
>>
>>   tail -f -n 0 /var/log/messages
>> Sep 18 06:06:02 cgceccprd1 modcluster: Starting service: eccsapmnt on node
>> Sep 18 06:06:08 cgceccprd1 modcluster: Starting service: eccsapmnt on
>> node cgceccprd1.combinedgroup.net <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>
>>
>>
>> but the service is not starting.on luci , it's showing both nodes are
>> online.but on clustat different
>>
>> main error getting on messages is
>>
>> Sep 18 03:35:48 cgceccprd1 fenced[8424]: fencing node
>> cgceccprd2.combinedgroup.net <http://cgceccprd2.**combinedgroup.net<http://cgceccprd2.combinedgroup.net>>
>> still
>>
>> retrying
>> Sep 18 04:06:16 cgceccprd1 fenced[8424]: fencing node
>> cgceccprd2.combinedgroup.net <http://cgceccprd2.**combinedgroup.net<http://cgceccprd2.combinedgroup.net>>
>> still
>>
>> retrying
>> Sep 18 04:36:45 cgceccprd1 fenced[8424]: fencing node
>> cgceccprd2.combinedgroup.net <http://cgceccprd2.**combinedgroup.net<http://cgceccprd2.combinedgroup.net>>
>> still
>>
>> retrying
>> Sep 18 05:07:14 cgceccprd1 fenced[8424]: fencing node
>> cgceccprd2.combinedgroup.net <http://cgceccprd2.**combinedgroup.net<http://cgceccprd2.combinedgroup.net>>
>> still
>>
>> retrying
>> Sep 18 05:37:42 cgceccprd1 fenced[8424]: fencing node
>> cgceccprd2.combinedgroup.net <http://cgceccprd2.**combinedgroup.net<http://cgceccprd2.combinedgroup.net>>
>> still
>>
>> retrying
>>
>> These messages from node1.i am geeting same message on node saying that
>>
>> cgceccprd2 fenced[8424]: fencing node cgceccprd1.combinedgroup.net
>> <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>
>> still retrying
>>
>>
>> i don't know what is problem here.
>>
>> please help me solve
>> Regards,
>> Ben
>>
>> On Tue, Sep 18, 2012 at 4:42 AM, Digimer <lists at alteeve.ca
>> <mailto:lists at alteeve.ca>> wrote:
>>
>>     On 09/17/2012 06:07 PM, Ben .T.George wrote:
>>
>>         Hi
>>
>>         My cluster is failing to start.
>>
>>         if i check clustat on node1, status is showing node1 online and
>>         node2
>>         offline. If the check clustat on node2, node2 is showing online
>> and
>>         node1 is offline
>>
>>         i checked logs.fanced is throwing errors.how can i rectify this
>>
>>         Sep 17 23:24:54 fenced fencing node cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>>
>> still retrying
>>
>>         Sep 17 23:55:06 fenced fencing node cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>>
>> still retrying
>>
>>         Sep 18 00:25:19 fenced fencing node cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>>
>> still retrying
>>
>>         Sep 18 00:55:03 fenced fenced 3.0.12.1 started
>>         Sep 18 00:55:03 fenced failed to get dbus connection
>>         Sep 18 00:55:55 fenced fencing node cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >>
>>
>>         Sep 18 00:55:55 fenced fence cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>>
>> dev 0.0 agent none
>>         result: error
>>
>>         no method
>>         Sep 18 00:55:55 fenced fence cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>>
>> failed
>>
>>         Sep 18 00:55:58 fenced fencing node cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >>
>>
>>         Sep 18 00:55:58 fenced fence cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>>
>> dev 0.0 agent none
>>         result: error
>>
>>         no method
>>         Sep 18 00:55:58 fenced fence cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>>
>> failed
>>
>>         Sep 18 00:56:01 fenced fencing node cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >>
>>
>>         Sep 18 00:56:01 fenced fence cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>>
>> dev 0.0 agent none
>>         result: error
>>
>>         no method
>>         Sep 18 00:56:01 fenced fence cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedg**roup.net<http://combinedgroup.net>
>>
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>>>
>> failed
>>
>>
>>
>>         please help me solve this issue
>>
>>         Regards,
>>         Ben
>>
>>
>>     What is your cluster.conf?
>>
>>     likely you either have no fencing configured, or your fencing is not
>>     working. Either way, failing to fence is a critical problem and the
>>     cluster will hang, just as you're seeing here. This is by design.
>>     Better to hang a cluster than to corrupt it.
>>
>>     digimer
>>
>>     --
>>     Digimer
>>     Papers and Projects: https://alteeve.ca
>>
>>
>>
>>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120918/24667f87/attachment.htm>

From lists at alteeve.ca  Tue Sep 18 03:37:56 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 17 Sep 2012 23:37:56 -0400
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <CA+C_GOVPF=N26=3Tv6u6Qv2YVtGNuSBRu1htUROi-U5WSzqnEg@mail.gmail.com>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
	<CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
	<CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>
	<CAE7pJ3BPgU=-3NG82SpkwpP9BptW-sCfCYNyn3y-y-MJ59rjOw@mail.gmail.com>
	<CA+C_GOWdWZqxe_QBw+u8X04tX+XGd0i=A-fZ1bzUk-GQyXv8eQ@mail.gmail.com>
	<5057D671.4020503@alteeve.ca>
	<CA+C_GOUnA+UuHg8w4zYy2omy=1+14qr=DuQoVWwXcpMqHkeW4g@mail.gmail.com>
	<5057E9FD.6020408@alteeve.ca>
	<CA+C_GOVPF=N26=3Tv6u6Qv2YVtGNuSBRu1htUROi-U5WSzqnEg@mail.gmail.com>
Message-ID: <5057EC94.9030904@alteeve.ca>

Reply to the list, not to me directly.

If that is all you saw then you didn't start the cluster. You should see 
output like;

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Starting_the_Cluster_for_the_First_Time

digimer

On 09/17/2012 11:32 PM, Ben .T.George wrote:
> Hi
>
> no sir, i am getting only these 2 lines. After 2-3 seconds that same
> fencd message came
>
> cgceccprd1 fenced[8424]: fencing node cgceccprd2.combinedgroup.net
> <http://cgceccprd2.combinedgroup.net> still retrying
>
> i am not getting any message in /var/log/cluster/rgmanager.log
>
> on that log directory only updating file fencd.log
>
>
> regards,
> Ben
>
> On Tue, Sep 18, 2012 at 6:26 AM, Digimer <lists at alteeve.ca
> <mailto:lists at alteeve.ca>> wrote:
>
>     On 09/17/2012 11:17 PM, Ben .T.George wrote:
>
>            tail -f -n 0 /var/log/messages
>         Sep 18 06:06:02 cgceccprd1 modcluster: Starting service:
>         eccsapmnt on node
>         Sep 18 06:06:08 cgceccprd1 modcluster: Starting service:
>         eccsapmnt on
>         node cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>         <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>>
>
>
>     In addition to my other reply;
>
>     There should be a *lot* more output than this. That alone is useless.
>
>
>     --
>     Digimer
>     Papers and Projects: https://alteeve.ca
>
>
>
>
> --
> Yours Sincerely
>
> *#!/usr/bin/env python
> #Mysignature.py :)*
>
> Signature = " " " Ben.T.George \n
>                    Linux System Administrator \n
>                    Diyar United Company \n
>                    kuwait \n
>                    Phone : +965 - 50629829 \n " ""
>
> Print Signature
>


-- 
Digimer
Papers and Projects: https://alteeve.ca


From lists at alteeve.ca  Tue Sep 18 03:40:04 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 17 Sep 2012 23:40:04 -0400
Subject: [Linux-cluster] cluster fenced error
In-Reply-To: <CA+C_GOW=epBq8N-XFhzQ=Fij+Ahz6J-9B+gASyb5PGi31v=8RQ@mail.gmail.com>
References: <CA+C_GOV08RN0bv4v--xVyY1hkU54r4ntVyXvHk4UgUtWjt3sSg@mail.gmail.com>
	<5057D17D.9060108@alteeve.ca>
	<CA+C_GOXhjW8aeEb586Oum81Qn+u2ryKNwx1pDHA_a7v6WBrcgw@mail.gmail.com>
	<5057E9C5.60506@alteeve.ca>
	<CA+C_GOW=epBq8N-XFhzQ=Fij+Ahz6J-9B+gASyb5PGi31v=8RQ@mail.gmail.com>
Message-ID: <5057ED14.4030601@alteeve.ca>

On 09/17/2012 11:37 PM, Ben .T.George wrote:
> Hi thanks for your reply
>
> This is Cisco UCS machine. yesterday cisco guys created a separate
> vswitch for this heartbeat.
>
> regards,
> Ben
I don't use Cisco, I don't know what any of this means. Ask your network 
people to ensure that a static multicast group exists for the multicast 
IP reported by 'cman_tool status'.

Then fix your fencing. If you don't, your cluster will hang the first 
time anything goes wrong.

-- 
Digimer
Papers and Projects: https://alteeve.ca


From bentech4you at gmail.com  Tue Sep 18 03:50:44 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Tue, 18 Sep 2012 06:50:44 +0300
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <5057EC94.9030904@alteeve.ca>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
	<CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
	<CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>
	<CAE7pJ3BPgU=-3NG82SpkwpP9BptW-sCfCYNyn3y-y-MJ59rjOw@mail.gmail.com>
	<CA+C_GOWdWZqxe_QBw+u8X04tX+XGd0i=A-fZ1bzUk-GQyXv8eQ@mail.gmail.com>
	<5057D671.4020503@alteeve.ca>
	<CA+C_GOUnA+UuHg8w4zYy2omy=1+14qr=DuQoVWwXcpMqHkeW4g@mail.gmail.com>
	<5057E9FD.6020408@alteeve.ca>
	<CA+C_GOVPF=N26=3Tv6u6Qv2YVtGNuSBRu1htUROi-U5WSzqnEg@mail.gmail.com>
	<5057EC94.9030904@alteeve.ca>
Message-ID: <CA+C_GOUPfwE5_oBGd-AdZ3Qk3vwcKxnpCQkq8QswNnYHaLnw5A@mail.gmail.com>

Hi

yes on my test machine i created this same cluster, at that time isaw these
cluster starting messages..but on this production machine, it;s only
showing that 2 lines, on node1 and on node2 , nothing

regards,
Ben

On Tue, Sep 18, 2012 at 6:37 AM, Digimer <lists at alteeve.ca> wrote:

> Reply to the list, not to me directly.
>
> If that is all you saw then you didn't start the cluster. You should see
> output like;
>
> https://alteeve.ca/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial#**
> Starting_the_Cluster_for_the_**First_Time<https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Starting_the_Cluster_for_the_First_Time>
>
> digimer
>
>
> On 09/17/2012 11:32 PM, Ben .T.George wrote:
>
>> Hi
>>
>> no sir, i am getting only these 2 lines. After 2-3 seconds that same
>> fencd message came
>>
>> cgceccprd1 fenced[8424]: fencing node cgceccprd2.combinedgroup.net
>> <http://cgceccprd2.**combinedgroup.net<http://cgceccprd2.combinedgroup.net>>
>> still retrying
>>
>>
>> i am not getting any message in /var/log/cluster/rgmanager.log
>>
>> on that log directory only updating file fencd.log
>>
>>
>> regards,
>> Ben
>>
>> On Tue, Sep 18, 2012 at 6:26 AM, Digimer <lists at alteeve.ca
>> <mailto:lists at alteeve.ca>> wrote:
>>
>>     On 09/17/2012 11:17 PM, Ben .T.George wrote:
>>
>>            tail -f -n 0 /var/log/messages
>>         Sep 18 06:06:02 cgceccprd1 modcluster: Starting service:
>>         eccsapmnt on node
>>         Sep 18 06:06:08 cgceccprd1 modcluster: Starting service:
>>         eccsapmnt on
>>         node cgceccprd1.combinedgroup.net
>>         <http://cgceccprd1.**combinedgroup.net<http://cgceccprd1.combinedgroup.net>
>> >
>>         <http://cgceccprd1.__combinedgroup.net
>>
>>         <http://cgceccprd1.combinedgroup.net>>
>>
>>
>>     In addition to my other reply;
>>
>>     There should be a *lot* more output than this. That alone is useless.
>>
>>
>>     --
>>     Digimer
>>     Papers and Projects: https://alteeve.ca
>>
>>
>>
>>
>>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120918/6490e0be/attachment.htm>

From lists at alteeve.ca  Tue Sep 18 04:04:35 2012
From: lists at alteeve.ca (Digimer)
Date: Tue, 18 Sep 2012 00:04:35 -0400
Subject: [Linux-cluster] 2 node cluster showing strange behaviour
In-Reply-To: <CA+C_GOUPfwE5_oBGd-AdZ3Qk3vwcKxnpCQkq8QswNnYHaLnw5A@mail.gmail.com>
References: <CA+C_GOW+trr4A6v2+Kgg+tbQpwNRFKTSXKEKMR_etu-S9T=YTw@mail.gmail.com>
	<CAKrd531YAEOGfbAXZcy+aWNnbqp+=NQDi2Yyth7G13AQ8oCMCw@mail.gmail.com>
	<CA+C_GOVTbeo77pWm7iD7hyJWPNERou6MAFE=6QbR3HDLPdsRUA@mail.gmail.com>
	<CAE7pJ3BPgU=-3NG82SpkwpP9BptW-sCfCYNyn3y-y-MJ59rjOw@mail.gmail.com>
	<CA+C_GOWdWZqxe_QBw+u8X04tX+XGd0i=A-fZ1bzUk-GQyXv8eQ@mail.gmail.com>
	<5057D671.4020503@alteeve.ca>
	<CA+C_GOUnA+UuHg8w4zYy2omy=1+14qr=DuQoVWwXcpMqHkeW4g@mail.gmail.com>
	<5057E9FD.6020408@alteeve.ca>
	<CA+C_GOVPF=N26=3Tv6u6Qv2YVtGNuSBRu1htUROi-U5WSzqnEg@mail.gmail.com>
	<5057EC94.9030904@alteeve.ca>
	<CA+C_GOUPfwE5_oBGd-AdZ3Qk3vwcKxnpCQkq8QswNnYHaLnw5A@mail.gmail.com>
Message-ID: <5057F2D3.8040401@alteeve.ca>

Probably because the cluster is hung and still running. Disable cman and 
rgmanager so they don't start at boot, reboot both nodes, start tail'ing 
syslog and then start the cluster back up.

On 09/17/2012 11:50 PM, Ben .T.George wrote:
> Hi
>
> yes on my test machine i created this same cluster, at that time isaw
> these cluster starting messages..but on this production machine, it;s
> only showing that 2 lines, on node1 and on node2 , nothing
>
> regards,
> Ben
>
> On Tue, Sep 18, 2012 at 6:37 AM, Digimer <lists at alteeve.ca
> <mailto:lists at alteeve.ca>> wrote:
>
>     Reply to the list, not to me directly.
>
>     If that is all you saw then you didn't start the cluster. You should
>     see output like;
>
>     https://alteeve.ca/w/2-Node___Red_Hat_KVM_Cluster_Tutorial#__Starting_the_Cluster_for_the___First_Time
>     <https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Starting_the_Cluster_for_the_First_Time>
>
>     digimer
>
>
>     On 09/17/2012 11:32 PM, Ben .T.George wrote:
>
>         Hi
>
>         no sir, i am getting only these 2 lines. After 2-3 seconds that same
>         fencd message came
>
>         cgceccprd1 fenced[8424]: fencing node
>         cgceccprd2.combinedgroup.net <http://cgceccprd2.combinedgroup.net>
>         <http://cgceccprd2.__combinedgroup.net
>         <http://cgceccprd2.combinedgroup.net>> still retrying
>
>
>         i am not getting any message in /var/log/cluster/rgmanager.log
>
>         on that log directory only updating file fencd.log
>
>
>         regards,
>         Ben
>
>         On Tue, Sep 18, 2012 at 6:26 AM, Digimer <lists at alteeve.ca
>         <mailto:lists at alteeve.ca>
>         <mailto:lists at alteeve.ca <mailto:lists at alteeve.ca>>> wrote:
>
>              On 09/17/2012 11:17 PM, Ben .T.George wrote:
>
>                     tail -f -n 0 /var/log/messages
>                  Sep 18 06:06:02 cgceccprd1 modcluster: Starting service:
>                  eccsapmnt on node
>                  Sep 18 06:06:08 cgceccprd1 modcluster: Starting service:
>                  eccsapmnt on
>                  node cgceccprd1.combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>
>                  <http://cgceccprd1.__combinedgroup.net
>         <http://cgceccprd1.combinedgroup.net>>
>                  <http://cgceccprd1.__combinedgroup.net
>         <http://combinedgroup.net>
>
>                  <http://cgceccprd1.combinedgroup.net>>
>
>
>              In addition to my other reply;
>
>              There should be a *lot* more output than this. That alone
>         is useless.
>
>
>              --
>              Digimer
>              Papers and Projects: https://alteeve.ca
>
>
>
>
>
>
>     --
>     Digimer
>     Papers and Projects: https://alteeve.ca
>
>
>
>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca


From urgrue at bulbous.org  Tue Sep 18 13:26:43 2012
From: urgrue at bulbous.org (urgrue)
Date: Tue, 18 Sep 2012 16:26:43 +0300
Subject: [Linux-cluster] HA LVM won't strip tags
Message-ID: <1347974803.23203.140661129451117.58F9AF17@webmail.messagingengine.com>

Hi,
Using tag-based HA LVM. If NodeA loses access to the disks, it's
naturally unable to strip the tags out.
NodeB will not recover the service because the tags of NodeA are in the
VG. Even though NodeB can fully communicate with the NodeA, quorum is
established, etc.

Question 1) couldn't I somehow set up NodeB to fence NodeA and strip the
tags? This is what it does if it thinks NodeA is offline, anyway.

Question 2) How can I recover most gracefully from this? I'm stuck in a
situation where NodeB has access to disks but refuses to activate them
due to the tags of NodeA being there.
I can manually remove the tag, or I can fence NodeA myself and then
NodeB will happily declare "I can claim this volume group" and survive.
The former is a little scary in terms of potential human error, while
the latter is awful since other services might be successfully still
running.
Any better options out there?

Thanks in advance.


From mauro.parente at innovationconsulting.it  Tue Sep 18 14:18:00 2012
From: mauro.parente at innovationconsulting.it (Mauro Parente IC)
Date: Tue, 18 Sep 2012 16:18:00 +0200
Subject: [Linux-cluster] GFS2 filesystem freeze issue
Message-ID: <50588298.8030306@innovationconsulting.it>

Hi to all,

I have setup a 2 node's RHEL cluster that have access to an iscsi volume
formatted in GFS2.
All work perfectly, except when i try to freeze/unfreeze
the filesystem using gfs2_tool.

1st case (all work's fine):
  root at node1# gfs2_tool freeze /mnt/gfs2-data
  root at node1# gfs2_tool unfreeze /mnt/gfs2-data

2nd case (error):
  root at node1# gfs2_tool freeze /mnt/gfs2-data
  root at node1# echo 1 > /mnt/gfs2-data/testfile

at this point the echo command is (correctly) waiting for write. In
another session:
  root at node1# gfs2_tool unfreeze /mnt/gfs2-data

even this and all subsequent command involving this mountpoint remain
"freezed"


Regards

-- 
*Mauro Parente*
INNOVATION CONSULTING
Via F.lli Biancheri, 1 - 18012 Bordighera - ITALY
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120918/90372c0f/attachment.htm>

From rpeterso at redhat.com  Tue Sep 18 14:29:54 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 18 Sep 2012 10:29:54 -0400 (EDT)
Subject: [Linux-cluster] GFS2 filesystem freeze issue
In-Reply-To: <50588298.8030306@innovationconsulting.it>
Message-ID: <2091891225.1477638.1347978594384.JavaMail.root@redhat.com>

----- Original Message -----
| Hi to all,
| 
| I have setup a 2 node's RHEL cluster that have access to an iscsi
| volume
| formatted in GFS2.
| All work perfectly, except when i try to freeze/unfreeze
| the filesystem using gfs2_tool.
| 
| 1st case (all work's fine):
|   root at node1# gfs2_tool freeze /mnt/gfs2-data
|   root at node1# gfs2_tool unfreeze /mnt/gfs2-data
| 
| 2nd case (error):
|   root at node1# gfs2_tool freeze /mnt/gfs2-data
|   root at node1# echo 1 > /mnt/gfs2-data/testfile
| 
| at this point the echo command is (correctly) waiting for write. In
| another session:
|   root at node1# gfs2_tool unfreeze /mnt/gfs2-data
| 
| even this and all subsequent command involving this mountpoint remain
| "freezed"
| 
| 
| Regards
| 
| --
| *Mauro Parente*
| INNOVATION CONSULTING
| Via F.lli Biancheri, 1 - 18012 Bordighera - ITALY
| 
| --
| Linux-cluster mailing list
| Linux-cluster at redhat.com
| https://www.redhat.com/mailman/listinfo/linux-cluster

Hi Mauro,

I'm not sure if this is the same issue, but Eric Sandeen posted a GFS2
patch to cluster-devel that may solve the problem:
https://www.redhat.com/archives/cluster-devel/2012-September/msg00014.html
As far as I know, it hasn't been pushed upstream yet, but will be soon.

What kernel are you running when you get this?

Regards,

Bob Peterson
Red Hat File Systems


From gladiator0520 at gmail.com  Tue Sep 18 15:18:08 2012
From: gladiator0520 at gmail.com (H T)
Date: Tue, 18 Sep 2012 11:18:08 -0400
Subject: [Linux-cluster] FATAL: Module lock_dlm not found
Message-ID: <CAG0AA8B1jwgKW81r2kaXGRo5e4BEhBUWw19D5=odZQZ=ZHQ3Bw@mail.gmail.com>

Hi,

I'm trying to install a HA cluster + CONGA using 2 nodes with RH Linux 5.7.
I installed luci and ricci on both nodes with no issues but when starting
cman, I get the following error "FATAL: Module lock_dlm not found".  I
looked in /etc/init.d/cman script and the line that seems to be causing the
error to be fired is in the function load_modules.  Here is the line:
errmsg=$( /sbin/modprobe lock_dlm 2>&1 ) || return 1.  Any help will be
greatly appreciated.

best regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120918/64fd8761/attachment.htm>

From volker at ixolution.de  Tue Sep 18 21:03:27 2012
From: volker at ixolution.de (Volker Dormeyer)
Date: Tue, 18 Sep 2012 23:03:27 +0200
Subject: [Linux-cluster] HA LVM won't strip tags
In-Reply-To: <1347974803.23203.140661129451117.58F9AF17@webmail.messagingengine.com>
References: <1347974803.23203.140661129451117.58F9AF17@webmail.messagingengine.com>
Message-ID: <20120918210327.GA29100@dijkstra>

Hi,

On Tue, Sep 18, 2012 at 04:26:43PM +0300,
urgrue <urgrue at bulbous.org> wrote:
> Using tag-based HA LVM. If NodeA loses access to the disks, it's
> naturally unable to strip the tags out.

> NodeB will not recover the service because the tags of NodeA are in the
> VG. Even though NodeB can fully communicate with the NodeA, quorum is
> established, etc.
> 
> Question 1) couldn't I somehow set up NodeB to fence NodeA and strip the
> tags? This is what it does if it thinks NodeA is offline, anyway.

You can use the self_fence option of the LVM resource agent.
If the node which loses disk-access is not able to clean-up the tags,
it tries to reboot itself by issuing "reboot -fn".

Best Regards,
Volker


From urgrue at bulbous.org  Wed Sep 19 08:09:24 2012
From: urgrue at bulbous.org (urgrue)
Date: Wed, 19 Sep 2012 11:09:24 +0300
Subject: [Linux-cluster] HA LVM won't strip tags
In-Reply-To: <20120918210327.GA29100@dijkstra>
References: <1347974803.23203.140661129451117.58F9AF17@webmail.messagingengine.com>
	<20120918210327.GA29100@dijkstra>
Message-ID: <1348042164.20733.140661129833833.16591A23@webmail.messagingengine.com>

On Wed, Sep 19, 2012, at 00:03, Volker Dormeyer wrote:
> You can use the self_fence option of the LVM resource agent.
> If the node which loses disk-access is not able to clean-up the tags,
> it tries to reboot itself by issuing "reboot -fn".

I have that option on and was wondering why it didn't quite seem to work
as I expected.
Looking back through my logs, I can see the reason: sometimes the
unmount succeeds, therefore self_fence doesn't take effect at that
point. It should then try to strip the lvm tags and self_fence if THAT
fails, but it doesn't do this part at all?

What I do is put the LUN in a 'not ready' state, so it becomes
unreadable (and unwriteable). Here's an example of where it failed:

On node 2:
Sep 18 14:56:43 rgmanager [fs] fs:fs_sanlv: is_alive: failed write test
on [/var/lib/mysql]. Return code: 1
Sep 18 14:56:43 rgmanager [fs] fs:fs_sanlv: Mount point is not
accessible!
Sep 18 14:56:43 rgmanager status on fs "fs_sanlv" returned 1 (generic
error)
Sep 18 14:56:43 rgmanager Stopping service service:srv_mysql
Sep 18 14:56:43 rgmanager [mysql] Verifying Configuration Of
mysql:res_mysql
Sep 18 14:56:43 rgmanager [mysql] Verifying Configuration Of
mysql:res_mysql > Succeed
Sep 18 14:56:44 rgmanager [mysql] Stopping Service mysql:res_mysql
Sep 18 14:56:48 rgmanager [mysql] Stopping Service mysql:res_mysql >
Succeed
Sep 18 14:56:48 rgmanager [ip] Removing IPv4 address 10.1.0.7/22 from
bond1
Sep 18 14:56:58 rgmanager [fs] unmounting /var/lib/mysql
Sep 18 14:57:00 rgmanager Service service:srv_mysql is recovering
Sep 18 14:57:00 rgmanager Sent remote-start request to 1

Now from node 1:
Sep 18 14:57:00 rgmanager Recovering failed service service:srv_mysql
Sep 18 14:57:02 rgmanager [lvm] Starting volume group, sanvg
Sep 18 14:57:02 rgmanager [lvm] Someone else owns this volume group
Sep 18 14:57:02 rgmanager start on lvm "res_sanvg" returned 1 (generic
error)
Sep 18 14:57:02 rgmanager #68: Failed to start service:srv_mysql; return
value: 1

Here's the relevant part of cluster.conf:
                <resources>
                        <lvm name="res_sanvg" self_fence="on"
                        vg_name="sanvg"/>
                        <ip address="10.1.0.7/22" sleeptime="10"/>
                        <fs device="/dev/sanvg/sanlv" fsid="29088"
                        mountpoint="/var/lib/mysql" name="fs_sanlv"
                        options="noatime" self_fence="on"/>
                        <mysql config_file="/etc/my.cnf"
                        listen_address="10.1.0.7" name="res_mysql"
                        shutdown_wait="10" startup_wait="5"/>
                </resources>
                <service domain="DC0" name="srv_mysql"
                recovery="relocate">
                        <lvm ref="res_sanvg">
                                <fs ref="fs_sanlv">
                                        <ip ref="10.1.0.7/22">
                                                <mysql ref="res_mysql"/>
                                        </ip>
                                </fs>
                        </lvm>
                </service>


Shouldn't there be an "[lvm] stripping tags from xxxx" after the umount,
which should fail and result in self_fence?

Thanks.


From a.holway at syseleven.de  Thu Sep 20 14:08:47 2012
From: a.holway at syseleven.de (Andrew Holway)
Date: Thu, 20 Sep 2012 16:08:47 +0200
Subject: [Linux-cluster] GFS fail with iozone
Message-ID: <6BC17920-7E5C-44A8-B434-C710CBD7F864@syseleven.de>

Hello,

I have set up a 4 node cluster. They are interconnected with an IPoIB (connected mode)

Whist running a benchmark with IOzone I got the following errors:

IO seems to have halted.

Thanks,

Andrew

Sep 20 16:01:57 node001 kernel: INFO: task iozone:15816 blocked for more than 120 seconds.
Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15816  15374 0x00000080
Sep 20 16:01:57 node001 kernel: ffff880fd5ebbac8 0000000000000086 ffff880fd5ebba38 ffffffff81276b66
Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ebba58 ffffffff81091f97
Sep 20 16:01:57 node001 kernel: ffff880fe238c638 ffff880fd5ebbfd8 000000000000fb88 ffff880fe238c638
Sep 20 16:01:57 node001 kernel: Call Trace:
Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 20 16:01:57 node001 kernel: INFO: task iozone:15818 blocked for more than 120 seconds.
Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15818  15374 0x00000080
Sep 20 16:01:57 node001 kernel: ffff880fbe5b3ac8 0000000000000082 0000000000000000 ffff881ff95587a0
Sep 20 16:01:57 node001 kernel: ffff881000000002 ffff88100ee13048 00000000be5b3a58 00000040ffffffff
Sep 20 16:01:57 node001 kernel: ffff88100eed1ab8 ffff880fbe5b3fd8 000000000000fb88 ffff88100eed1ab8
Sep 20 16:01:57 node001 kernel: Call Trace:
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 20 16:01:57 node001 kernel: INFO: task iozone:15820 blocked for more than 120 seconds.
Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000008     0 15820  15374 0x00000080
Sep 20 16:01:57 node001 kernel: ffff880ffed7bac8 0000000000000086 0000000000000000 ffffffff81276b66
Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000fed7ba58 00000040ffffffff
Sep 20 16:01:57 node001 kernel: ffff880fbdd51ab8 ffff880ffed7bfd8 000000000000fb88 ffff880fbdd51ab8
Sep 20 16:01:57 node001 kernel: Call Trace:
Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 20 16:01:57 node001 kernel: INFO: task iozone:15822 blocked for more than 120 seconds.
Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15822  15374 0x00000080
Sep 20 16:01:57 node001 kernel: ffff880fd5dddac8 0000000000000086 0000000000000000 ffffffff81276b66
Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ddda58 ffffffff81091f97
Sep 20 16:01:57 node001 kernel: ffff880fe238d098 ffff880fd5dddfd8 000000000000fb88 ffff880fe238d098
Sep 20 16:01:57 node001 kernel: Call Trace:
Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 20 16:01:57 node001 kernel: INFO: task iozone:15824 blocked for more than 120 seconds.
Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15824  15374 0x00000080
Sep 20 16:01:57 node001 kernel: ffff880fbe5edac8 0000000000000086 0000000000000000 ffffffff81276b66
Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000be5eda58 00000040ffffffff
Sep 20 16:01:57 node001 kernel: ffff880ff69085f8 ffff880fbe5edfd8 000000000000fb88 ffff880ff69085f8
Sep 20 16:01:57 node001 kernel: Call Trace:
Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 20 16:01:57 node001 kernel: INFO: task iozone:15826 blocked for more than 120 seconds.
Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15826  15374 0x00000080
Sep 20 16:01:57 node001 kernel: ffff880fbe7cfac8 0000000000000086 0000000000000000 ffffffff81276b66
Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fbe7cfa58 ffffffff81091f97
Sep 20 16:01:57 node001 kernel: ffff88100ddcbab8 ffff880fbe7cffd8 000000000000fb88 ffff88100ddcbab8
Sep 20 16:01:57 node001 kernel: Call Trace:
Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 20 16:01:57 node001 kernel: INFO: task iozone:15828 blocked for more than 120 seconds.
Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15828  15374 0x00000080
Sep 20 16:01:57 node001 kernel: ffff88100684bac8 0000000000000086 0000000000000000 ffffffff81276b66
Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 000000000684ba58 00000040ffffffff
Sep 20 16:01:57 node001 kernel: ffff88100edc7af8 ffff88100684bfd8 000000000000fb88 ffff88100edc7af8
Sep 20 16:01:57 node001 kernel: Call Trace:
Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 20 16:01:57 node001 kernel: INFO: task iozone:15830 blocked for more than 120 seconds.
Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000000     0 15830  15374 0x00000080
Sep 20 16:01:57 node001 kernel: ffff880fbdd0fac8 0000000000000082 0000000000000000 ffffffff81276b66
Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000bdd0fa58 00000040ffffffff
Sep 20 16:01:57 node001 kernel: ffff88100de93098 ffff880fbdd0ffd8 000000000000fb88 ffff88100de93098
Sep 20 16:01:57 node001 kernel: Call Trace:
Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b


From a.holway at syseleven.de  Thu Sep 20 14:14:27 2012
From: a.holway at syseleven.de (Andrew Holway)
Date: Thu, 20 Sep 2012 16:14:27 +0200
Subject: [Linux-cluster] GFS fail with iozone
In-Reply-To: <6BC17920-7E5C-44A8-B434-C710CBD7F864@syseleven.de>
References: <6BC17920-7E5C-44A8-B434-C710CBD7F864@syseleven.de>
Message-ID: <F1DE4351-2CF4-4D96-93DD-F128DB6F4252@syseleven.de>

Aslo,

IOzone gave this error: Error writing block 29813, fd= 3

GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Trying to acquire journal lock...
GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Looking at journal...
GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Acquiring the transaction lock...
GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Replaying journal...


GFS seemed to repair itself and things carried on working.

thanks,

Andrew

On Sep 20, 2012, at 4:08 PM, Andrew Holway wrote:

> Hello,
> 
> I have set up a 4 node cluster. They are interconnected with an IPoIB (connected mode)
> 
> Whist running a benchmark with IOzone I got the following errors:
> 
> IO seems to have halted.
> 
> Thanks,
> 
> Andrew
> 
> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15816 blocked for more than 120 seconds.
> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15816  15374 0x00000080
> Sep 20 16:01:57 node001 kernel: ffff880fd5ebbac8 0000000000000086 ffff880fd5ebba38 ffffffff81276b66
> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ebba58 ffffffff81091f97
> Sep 20 16:01:57 node001 kernel: ffff880fe238c638 ffff880fd5ebbfd8 000000000000fb88 ffff880fe238c638
> Sep 20 16:01:57 node001 kernel: Call Trace:
> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15818 blocked for more than 120 seconds.
> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15818  15374 0x00000080
> Sep 20 16:01:57 node001 kernel: ffff880fbe5b3ac8 0000000000000082 0000000000000000 ffff881ff95587a0
> Sep 20 16:01:57 node001 kernel: ffff881000000002 ffff88100ee13048 00000000be5b3a58 00000040ffffffff
> Sep 20 16:01:57 node001 kernel: ffff88100eed1ab8 ffff880fbe5b3fd8 000000000000fb88 ffff88100eed1ab8
> Sep 20 16:01:57 node001 kernel: Call Trace:
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15820 blocked for more than 120 seconds.
> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000008     0 15820  15374 0x00000080
> Sep 20 16:01:57 node001 kernel: ffff880ffed7bac8 0000000000000086 0000000000000000 ffffffff81276b66
> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000fed7ba58 00000040ffffffff
> Sep 20 16:01:57 node001 kernel: ffff880fbdd51ab8 ffff880ffed7bfd8 000000000000fb88 ffff880fbdd51ab8
> Sep 20 16:01:57 node001 kernel: Call Trace:
> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15822 blocked for more than 120 seconds.
> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15822  15374 0x00000080
> Sep 20 16:01:57 node001 kernel: ffff880fd5dddac8 0000000000000086 0000000000000000 ffffffff81276b66
> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ddda58 ffffffff81091f97
> Sep 20 16:01:57 node001 kernel: ffff880fe238d098 ffff880fd5dddfd8 000000000000fb88 ffff880fe238d098
> Sep 20 16:01:57 node001 kernel: Call Trace:
> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15824 blocked for more than 120 seconds.
> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15824  15374 0x00000080
> Sep 20 16:01:57 node001 kernel: ffff880fbe5edac8 0000000000000086 0000000000000000 ffffffff81276b66
> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000be5eda58 00000040ffffffff
> Sep 20 16:01:57 node001 kernel: ffff880ff69085f8 ffff880fbe5edfd8 000000000000fb88 ffff880ff69085f8
> Sep 20 16:01:57 node001 kernel: Call Trace:
> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15826 blocked for more than 120 seconds.
> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15826  15374 0x00000080
> Sep 20 16:01:57 node001 kernel: ffff880fbe7cfac8 0000000000000086 0000000000000000 ffffffff81276b66
> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fbe7cfa58 ffffffff81091f97
> Sep 20 16:01:57 node001 kernel: ffff88100ddcbab8 ffff880fbe7cffd8 000000000000fb88 ffff88100ddcbab8
> Sep 20 16:01:57 node001 kernel: Call Trace:
> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15828 blocked for more than 120 seconds.
> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15828  15374 0x00000080
> Sep 20 16:01:57 node001 kernel: ffff88100684bac8 0000000000000086 0000000000000000 ffffffff81276b66
> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 000000000684ba58 00000040ffffffff
> Sep 20 16:01:57 node001 kernel: ffff88100edc7af8 ffff88100684bfd8 000000000000fb88 ffff88100edc7af8
> Sep 20 16:01:57 node001 kernel: Call Trace:
> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15830 blocked for more than 120 seconds.
> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000000     0 15830  15374 0x00000080
> Sep 20 16:01:57 node001 kernel: ffff880fbdd0fac8 0000000000000082 0000000000000000 ffffffff81276b66
> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000bdd0fa58 00000040ffffffff
> Sep 20 16:01:57 node001 kernel: ffff88100de93098 ffff880fbdd0ffd8 000000000000fb88 ffff88100de93098
> Sep 20 16:01:57 node001 kernel: Call Trace:
> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> 
> 


From a.holway at syseleven.de  Thu Sep 20 14:25:49 2012
From: a.holway at syseleven.de (Andrew Holway)
Date: Thu, 20 Sep 2012 16:25:49 +0200
Subject: [Linux-cluster] GFS fail with iozone
In-Reply-To: <F1DE4351-2CF4-4D96-93DD-F128DB6F4252@syseleven.de>
References: <6BC17920-7E5C-44A8-B434-C710CBD7F864@syseleven.de>
	<F1DE4351-2CF4-4D96-93DD-F128DB6F4252@syseleven.de>
Message-ID: <AA5A5B0B-8CCB-40D1-8A64-873A55111750@syseleven.de>

It seems that my node004 is the problem.

I cannot kill the iozone processes and I find this in the logs.

Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 94 88 a0 00 00 20 00
Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 51 ff 90 00 00 20 00
Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 46 d5 c0 00 00 20 00
Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 03 da c7 78 00 00 20 00
Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 f5 8f 60 00 00 20 00
Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 30 7c 90 00 00 20 00
Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 05 79 8b e0 00 00 20 00
Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 04 37 13 08 00 00 20 00
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 20 16:02:15 node004 kernel: INFO: task glock_workqueue:9820 blocked for more than 120 seconds.
Sep 20 16:02:15 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 20 16:02:15 node004 kernel: glock_workque D 000000000000001b     0  9820      2 0x00000080
Sep 20 16:02:15 node004 kernel: ffff8820150a9c70 0000000000000046 0000000000000004 00000000aa8f20cf
Sep 20 16:02:15 node004 kernel: ffff881fffd050c8 0000000000000441 ffff8820150a9c10 ffffffff811acd5e
Sep 20 16:02:15 node004 kernel: ffff882015b39ab8 ffff8820150a9fd8 000000000000fb88 ffff882015b39ab8
Sep 20 16:02:15 node004 kernel: Call Trace:
Sep 20 16:02:15 node004 kernel: [<ffffffff811acd5e>] ? submit_bh+0x10e/0x150
Sep 20 16:02:15 node004 kernel: [<ffffffff8109cd39>] ? ktime_get_ts+0xa9/0xe0
Sep 20 16:02:15 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 20 16:02:15 node004 kernel: [<ffffffffa094aaca>] gfs2_log_flush+0x47a/0x6f0 [gfs2]
Sep 20 16:02:15 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 20 16:02:15 node004 kernel: [<ffffffffa09477d0>] inode_go_sync+0x80/0x160 [gfs2]
Sep 20 16:02:15 node004 kernel: [<ffffffffa0946336>] do_xmote+0x156/0x280 [gfs2]
Sep 20 16:02:15 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
Sep 20 16:02:15 node004 kernel: [<ffffffffa0946551>] run_queue+0xf1/0x1d0 [gfs2]
Sep 20 16:02:15 node004 kernel: [<ffffffffa0946d2a>] glock_work_func+0x7a/0x1b0 [gfs2]
Sep 20 16:02:15 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
Sep 20 16:02:15 node004 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 20 16:02:15 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
Sep 20 16:02:15 node004 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
Sep 20 16:02:15 node004 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Sep 20 16:02:15 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Sep 20 16:02:15 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e c0 00 00 20 00
Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117976
Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117977
Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117978
Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117979
Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e b8 00 00 08 00
Sep 20 16:02:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117975
Sep 20 16:02:22 node004 kernel: lost page write due to I/O error on dm-6
Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e e0 00 00 08 00
Sep 20 16:05:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117980
Sep 20 16:05:22 node004 kernel: lost page write due to I/O error on dm-6
Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: fatal: I/O error
Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   block = 117980
Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   function = log_write_header, file = fs/gfs2/log.c, line = 616
Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: about to withdraw this file system
Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: telling LM to unmount
Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: withdrawn
Sep 20 16:05:22 node004 kernel: Pid: 9820, comm: glock_workqueue Not tainted 2.6.32-279.el6.x86_64 #1
Sep 20 16:05:22 node004 kernel: Call Trace:
Sep 20 16:05:22 node004 kernel: [<ffffffffa0962062>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
Sep 20 16:05:22 node004 kernel: [<ffffffff814fea28>] ? out_of_line_wait_on_bit+0x78/0x90
Sep 20 16:05:22 node004 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 20 16:05:22 node004 kernel: [<ffffffffa09620d0>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
Sep 20 16:05:22 node004 kernel: [<ffffffff811adfb6>] ? __wait_on_buffer+0x26/0x30
Sep 20 16:05:22 node004 kernel: [<ffffffffa094a288>] ? log_write_header+0x3a8/0x490 [gfs2]
Sep 20 16:05:22 node004 kernel: [<ffffffffa094a951>] ? gfs2_log_flush+0x301/0x6f0 [gfs2]
Sep 20 16:05:22 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 20 16:05:22 node004 kernel: [<ffffffffa09477d0>] ? inode_go_sync+0x80/0x160 [gfs2]
Sep 20 16:05:22 node004 kernel: [<ffffffffa0946336>] ? do_xmote+0x156/0x280 [gfs2]
Sep 20 16:05:22 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
Sep 20 16:05:22 node004 kernel: [<ffffffffa0946551>] ? run_queue+0xf1/0x1d0 [gfs2]
Sep 20 16:05:22 node004 kernel: [<ffffffffa0946d2a>] ? glock_work_func+0x7a/0x1b0 [gfs2]
Sep 20 16:05:22 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
Sep 20 16:05:22 node004 kernel: [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 20 16:05:22 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
Sep 20 16:05:22 node004 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
Sep 20 16:05:22 node004 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
Sep 20 16:05:22 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Sep 20 16:05:22 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 00 08 b0 00 00 08 00
Sep 20 16:08:22 node004 kernel: Buffer I/O error on device dm-6, logical block 22
Sep 20 16:08:22 node004 kernel: lost page write due to I/O error on dm-6
Sep 20 16:16:05 node004 xinetd[4416]: START: node-state pid=14578 from=::ffff:10.141.255.254
Sep 20 16:16:05 node004 xinetd[4416]: EXIT: node-state status=0 pid=14578 duration=0(sec)
Sep 20 16:17:34 node004 xinetd[4416]: START: node-state pid=14653 from=::ffff:10.141.255.254
Sep 20 16:17:34 node004 xinetd[4416]: EXIT: node-state status=0 pid=14653 duration=0(sec)
Sep 20 16:17:36 node004 xinetd[4416]: START: node-state pid=14671 from=::ffff:10.141.255.254
Sep 20 16:17:36 node004 xinetd[4416]: EXIT: node-state status=0 pid=14671 duration=0(sec)
Sep 20 16:17:39 node004 xinetd[4416]: START: node-state pid=14690 from=::ffff:10.141.255.254
Sep 20 16:17:39 node004 xinetd[4416]: EXIT: node-state status=0 pid=14690 duration=0(sec)
Sep 20 16:17:41 node004 xinetd[4416]: START: node-state pid=14708 from=::ffff:10.141.255.254
Sep 20 16:17:41 node004 xinetd[4416]: EXIT: node-state status=0 pid=14708 duration=0(sec)
On Sep 20, 2012, at 4:14 PM, Andrew Holway wrote:

> Aslo,
> 
> IOzone gave this error: Error writing block 29813, fd= 3
> 
> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Trying to acquire journal lock...
> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Looking at journal...
> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Acquiring the transaction lock...
> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Replaying journal...
> 
> 
> GFS seemed to repair itself and things carried on working.
> 
> thanks,
> 
> Andrew
> 
> On Sep 20, 2012, at 4:08 PM, Andrew Holway wrote:
> 
>> Hello,
>> 
>> I have set up a 4 node cluster. They are interconnected with an IPoIB (connected mode)
>> 
>> Whist running a benchmark with IOzone I got the following errors:
>> 
>> IO seems to have halted.
>> 
>> Thanks,
>> 
>> Andrew
>> 
>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15816 blocked for more than 120 seconds.
>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15816  15374 0x00000080
>> Sep 20 16:01:57 node001 kernel: ffff880fd5ebbac8 0000000000000086 ffff880fd5ebba38 ffffffff81276b66
>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ebba58 ffffffff81091f97
>> Sep 20 16:01:57 node001 kernel: ffff880fe238c638 ffff880fd5ebbfd8 000000000000fb88 ffff880fe238c638
>> Sep 20 16:01:57 node001 kernel: Call Trace:
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15818 blocked for more than 120 seconds.
>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15818  15374 0x00000080
>> Sep 20 16:01:57 node001 kernel: ffff880fbe5b3ac8 0000000000000082 0000000000000000 ffff881ff95587a0
>> Sep 20 16:01:57 node001 kernel: ffff881000000002 ffff88100ee13048 00000000be5b3a58 00000040ffffffff
>> Sep 20 16:01:57 node001 kernel: ffff88100eed1ab8 ffff880fbe5b3fd8 000000000000fb88 ffff88100eed1ab8
>> Sep 20 16:01:57 node001 kernel: Call Trace:
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15820 blocked for more than 120 seconds.
>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000008     0 15820  15374 0x00000080
>> Sep 20 16:01:57 node001 kernel: ffff880ffed7bac8 0000000000000086 0000000000000000 ffffffff81276b66
>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000fed7ba58 00000040ffffffff
>> Sep 20 16:01:57 node001 kernel: ffff880fbdd51ab8 ffff880ffed7bfd8 000000000000fb88 ffff880fbdd51ab8
>> Sep 20 16:01:57 node001 kernel: Call Trace:
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15822 blocked for more than 120 seconds.
>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15822  15374 0x00000080
>> Sep 20 16:01:57 node001 kernel: ffff880fd5dddac8 0000000000000086 0000000000000000 ffffffff81276b66
>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ddda58 ffffffff81091f97
>> Sep 20 16:01:57 node001 kernel: ffff880fe238d098 ffff880fd5dddfd8 000000000000fb88 ffff880fe238d098
>> Sep 20 16:01:57 node001 kernel: Call Trace:
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15824 blocked for more than 120 seconds.
>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15824  15374 0x00000080
>> Sep 20 16:01:57 node001 kernel: ffff880fbe5edac8 0000000000000086 0000000000000000 ffffffff81276b66
>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000be5eda58 00000040ffffffff
>> Sep 20 16:01:57 node001 kernel: ffff880ff69085f8 ffff880fbe5edfd8 000000000000fb88 ffff880ff69085f8
>> Sep 20 16:01:57 node001 kernel: Call Trace:
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15826 blocked for more than 120 seconds.
>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15826  15374 0x00000080
>> Sep 20 16:01:57 node001 kernel: ffff880fbe7cfac8 0000000000000086 0000000000000000 ffffffff81276b66
>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fbe7cfa58 ffffffff81091f97
>> Sep 20 16:01:57 node001 kernel: ffff88100ddcbab8 ffff880fbe7cffd8 000000000000fb88 ffff88100ddcbab8
>> Sep 20 16:01:57 node001 kernel: Call Trace:
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15828 blocked for more than 120 seconds.
>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15828  15374 0x00000080
>> Sep 20 16:01:57 node001 kernel: ffff88100684bac8 0000000000000086 0000000000000000 ffffffff81276b66
>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 000000000684ba58 00000040ffffffff
>> Sep 20 16:01:57 node001 kernel: ffff88100edc7af8 ffff88100684bfd8 000000000000fb88 ffff88100edc7af8
>> Sep 20 16:01:57 node001 kernel: Call Trace:
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15830 blocked for more than 120 seconds.
>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000000     0 15830  15374 0x00000080
>> Sep 20 16:01:57 node001 kernel: ffff880fbdd0fac8 0000000000000082 0000000000000000 ffffffff81276b66
>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000bdd0fa58 00000040ffffffff
>> Sep 20 16:01:57 node001 kernel: ffff88100de93098 ffff880fbdd0ffd8 000000000000fb88 ffff88100de93098
>> Sep 20 16:01:57 node001 kernel: Call Trace:
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>> 
>> 
> 


From Ralf.Aumueller at informatik.uni-stuttgart.de  Thu Sep 20 16:21:46 2012
From: Ralf.Aumueller at informatik.uni-stuttgart.de (Ralf Aumueller)
Date: Thu, 20 Sep 2012 18:21:46 +0200
Subject: [Linux-cluster] Problem with rgmanager / rgmanager #37: Error
 receiving header from 2 sz=0 CTX 0x1f5d420
Message-ID: <505B429A.6050701@informatik.uni-stuttgart.de>

Hello,

we have a two node CentOS6.2 Cluster (rgmanager-3.0.12.1-5). After a reboot of
node2 the cluster won't work as expected. On node2 clustat just say's :

clustat:
Cluster Status for cluster1 @ Thu Sep 20 17:06:02 2012
Member Status: Quorate

 Member Name                                                 ID   Status
 ------ ----                                                 ---- ------
 node1                                                       1 Online
 node2                                                       2 Online, Local

No services listed, no rgmanager running. Also it is not possible to
start/migrate any services to node2.

On node1 a clustat lists all configured services + under Status rgmanager on
both nodes. On node1 the rgmanager.log has lots of:
rgmanager #37: Error receiving header from 2 sz=0 CTX 0x1XXXXXX

On node2 the rgmanager.log gives me:
rgmanager #34: Cannot get status for service ...

I did not change the cluster.conf. Only change on node2 was: +48MB and an new
BIOS version -- recommend by Dell Support).

Best regards,
Ralf


From heiko.nardmann at itechnical.de  Thu Sep 20 17:48:50 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Thu, 20 Sep 2012 19:48:50 +0200
Subject: [Linux-cluster] Problem with rgmanager / rgmanager #37: Error
 receiving header from 2 sz=0 CTX 0x1f5d420
In-Reply-To: <505B429A.6050701@informatik.uni-stuttgart.de>
References: <505B429A.6050701@informatik.uni-stuttgart.de>
Message-ID: <505B5702.60004@itechnical.de>

When I hear "BIOS update" and Dell then some red alarm signs appear in 
front of my mind ...

Okay, what I want to say is: did they also check the NIC firmware 
version? We here had really big trouble with the Broadcom NICs in our 
R610 machines.

Probably that has nothing to do with your problem but that has also been 
our impression while analyzing the strange problems we've seen here.

Coming back to your error messages ... the rgmanager cannot get the 
status for your service? Which kind of service are you trying to setup? 
It sounds like it has run before? Could you manually start the service 
agent(s) on node2 and see what that gives you?

Another thing: did you already activate debug inside cluster.conf?


Kind regards,

     Heiko

Am 20.09.2012 18:21, schrieb Ralf Aumueller:
> Hello,
>
> we have a two node CentOS6.2 Cluster (rgmanager-3.0.12.1-5). After a reboot of
> node2 the cluster won't work as expected. On node2 clustat just say's :
>
> clustat:
> Cluster Status for cluster1 @ Thu Sep 20 17:06:02 2012
> Member Status: Quorate
>
>   Member Name                                                 ID   Status
>   ------ ----                                                 ---- ------
>   node1                                                       1 Online
>   node2                                                       2 Online, Local
>
> No services listed, no rgmanager running. Also it is not possible to
> start/migrate any services to node2.
>
> On node1 a clustat lists all configured services + under Status rgmanager on
> both nodes. On node1 the rgmanager.log has lots of:
> rgmanager #37: Error receiving header from 2 sz=0 CTX 0x1XXXXXX
>
> On node2 the rgmanager.log gives me:
> rgmanager #34: Cannot get status for service ...
>
> I did not change the cluster.conf. Only change on node2 was: +48MB and an new
> BIOS version -- recommend by Dell Support).
>
> Best regards,
> Ralf
>


From lists at alteeve.ca  Thu Sep 20 17:54:08 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 20 Sep 2012 13:54:08 -0400
Subject: [Linux-cluster] Problem with rgmanager / rgmanager #37: Error
 receiving header from 2 sz=0 CTX 0x1f5d420
In-Reply-To: <505B429A.6050701@informatik.uni-stuttgart.de>
References: <505B429A.6050701@informatik.uni-stuttgart.de>
Message-ID: <505B5840.4030200@alteeve.ca>

On 09/20/2012 12:21 PM, Ralf Aumueller wrote:
> Hello,
>
> we have a two node CentOS6.2 Cluster (rgmanager-3.0.12.1-5). After a reboot of
> node2 the cluster won't work as expected. On node2 clustat just say's :
>
> clustat:
> Cluster Status for cluster1 @ Thu Sep 20 17:06:02 2012
> Member Status: Quorate
>
>   Member Name                                                 ID   Status
>   ------ ----                                                 ---- ------
>   node1                                                       1 Online
>   node2                                                       2 Online, Local
>
> No services listed, no rgmanager running. Also it is not possible to
> start/migrate any services to node2.
>
> On node1 a clustat lists all configured services + under Status rgmanager on
> both nodes. On node1 the rgmanager.log has lots of:
> rgmanager #37: Error receiving header from 2 sz=0 CTX 0x1XXXXXX
>
> On node2 the rgmanager.log gives me:
> rgmanager #34: Cannot get status for service ...
>
> I did not change the cluster.conf. Only change on node2 was: +48MB and an new
> BIOS version -- recommend by Dell Support).
>
> Best regards,
> Ralf
>

Sounds like you hit this bug: 
http://rhn.redhat.com/errata/RHBA-2012-0897.html

Update rgmanager to rgmanager-3.0.12.1-12 and you should be ok.

-- 
Digimer
Papers and Projects: https://alteeve.ca


From swhiteho at redhat.com  Fri Sep 21 08:34:11 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Fri, 21 Sep 2012 09:34:11 +0100
Subject: [Linux-cluster] GFS fail with iozone
In-Reply-To: <AA5A5B0B-8CCB-40D1-8A64-873A55111750@syseleven.de>
References: <6BC17920-7E5C-44A8-B434-C710CBD7F864@syseleven.de>
	<F1DE4351-2CF4-4D96-93DD-F128DB6F4252@syseleven.de>
	<AA5A5B0B-8CCB-40D1-8A64-873A55111750@syseleven.de>
Message-ID: <1348216451.2746.5.camel@menhir>

Hi,

On Thu, 2012-09-20 at 16:25 +0200, Andrew Holway wrote:
> It seems that my node004 is the problem.
> 
> I cannot kill the iozone processes and I find this in the logs.
> 
This looks like there is some problem with the i/o stack below the level
of GFS2. What kind of storage are you using? If this is a JBOD then
perhaps there is a faulty disk or something like that?

Steve.

> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 94 88 a0 00 00 20 00
> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 51 ff 90 00 00 20 00
> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 46 d5 c0 00 00 20 00
> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 03 da c7 78 00 00 20 00
> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 f5 8f 60 00 00 20 00
> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 30 7c 90 00 00 20 00
> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 05 79 8b e0 00 00 20 00
> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 04 37 13 08 00 00 20 00
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 20 16:02:15 node004 kernel: INFO: task glock_workqueue:9820 blocked for more than 120 seconds.
> Sep 20 16:02:15 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 20 16:02:15 node004 kernel: glock_workque D 000000000000001b     0  9820      2 0x00000080
> Sep 20 16:02:15 node004 kernel: ffff8820150a9c70 0000000000000046 0000000000000004 00000000aa8f20cf
> Sep 20 16:02:15 node004 kernel: ffff881fffd050c8 0000000000000441 ffff8820150a9c10 ffffffff811acd5e
> Sep 20 16:02:15 node004 kernel: ffff882015b39ab8 ffff8820150a9fd8 000000000000fb88 ffff882015b39ab8
> Sep 20 16:02:15 node004 kernel: Call Trace:
> Sep 20 16:02:15 node004 kernel: [<ffffffff811acd5e>] ? submit_bh+0x10e/0x150
> Sep 20 16:02:15 node004 kernel: [<ffffffff8109cd39>] ? ktime_get_ts+0xa9/0xe0
> Sep 20 16:02:15 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 20 16:02:15 node004 kernel: [<ffffffffa094aaca>] gfs2_log_flush+0x47a/0x6f0 [gfs2]
> Sep 20 16:02:15 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
> Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 20 16:02:15 node004 kernel: [<ffffffffa09477d0>] inode_go_sync+0x80/0x160 [gfs2]
> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946336>] do_xmote+0x156/0x280 [gfs2]
> Sep 20 16:02:15 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946551>] run_queue+0xf1/0x1d0 [gfs2]
> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946d2a>] glock_work_func+0x7a/0x1b0 [gfs2]
> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
> Sep 20 16:02:15 node004 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
> Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 20 16:02:15 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
> Sep 20 16:02:15 node004 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
> Sep 20 16:02:15 node004 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> Sep 20 16:02:15 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> Sep 20 16:02:15 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e c0 00 00 20 00
> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117976
> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117977
> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117978
> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117979
> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e b8 00 00 08 00
> Sep 20 16:02:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117975
> Sep 20 16:02:22 node004 kernel: lost page write due to I/O error on dm-6
> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e e0 00 00 08 00
> Sep 20 16:05:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117980
> Sep 20 16:05:22 node004 kernel: lost page write due to I/O error on dm-6
> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: fatal: I/O error
> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   block = 117980
> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   function = log_write_header, file = fs/gfs2/log.c, line = 616
> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: about to withdraw this file system
> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: telling LM to unmount
> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: withdrawn
> Sep 20 16:05:22 node004 kernel: Pid: 9820, comm: glock_workqueue Not tainted 2.6.32-279.el6.x86_64 #1
> Sep 20 16:05:22 node004 kernel: Call Trace:
> Sep 20 16:05:22 node004 kernel: [<ffffffffa0962062>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
> Sep 20 16:05:22 node004 kernel: [<ffffffff814fea28>] ? out_of_line_wait_on_bit+0x78/0x90
> Sep 20 16:05:22 node004 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 20 16:05:22 node004 kernel: [<ffffffffa09620d0>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
> Sep 20 16:05:22 node004 kernel: [<ffffffff811adfb6>] ? __wait_on_buffer+0x26/0x30
> Sep 20 16:05:22 node004 kernel: [<ffffffffa094a288>] ? log_write_header+0x3a8/0x490 [gfs2]
> Sep 20 16:05:22 node004 kernel: [<ffffffffa094a951>] ? gfs2_log_flush+0x301/0x6f0 [gfs2]
> Sep 20 16:05:22 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
> Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 20 16:05:22 node004 kernel: [<ffffffffa09477d0>] ? inode_go_sync+0x80/0x160 [gfs2]
> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946336>] ? do_xmote+0x156/0x280 [gfs2]
> Sep 20 16:05:22 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946551>] ? run_queue+0xf1/0x1d0 [gfs2]
> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946d2a>] ? glock_work_func+0x7a/0x1b0 [gfs2]
> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
> Sep 20 16:05:22 node004 kernel: [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
> Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 20 16:05:22 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
> Sep 20 16:05:22 node004 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
> Sep 20 16:05:22 node004 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
> Sep 20 16:05:22 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> Sep 20 16:05:22 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 00 08 b0 00 00 08 00
> Sep 20 16:08:22 node004 kernel: Buffer I/O error on device dm-6, logical block 22
> Sep 20 16:08:22 node004 kernel: lost page write due to I/O error on dm-6
> Sep 20 16:16:05 node004 xinetd[4416]: START: node-state pid=14578 from=::ffff:10.141.255.254
> Sep 20 16:16:05 node004 xinetd[4416]: EXIT: node-state status=0 pid=14578 duration=0(sec)
> Sep 20 16:17:34 node004 xinetd[4416]: START: node-state pid=14653 from=::ffff:10.141.255.254
> Sep 20 16:17:34 node004 xinetd[4416]: EXIT: node-state status=0 pid=14653 duration=0(sec)
> Sep 20 16:17:36 node004 xinetd[4416]: START: node-state pid=14671 from=::ffff:10.141.255.254
> Sep 20 16:17:36 node004 xinetd[4416]: EXIT: node-state status=0 pid=14671 duration=0(sec)
> Sep 20 16:17:39 node004 xinetd[4416]: START: node-state pid=14690 from=::ffff:10.141.255.254
> Sep 20 16:17:39 node004 xinetd[4416]: EXIT: node-state status=0 pid=14690 duration=0(sec)
> Sep 20 16:17:41 node004 xinetd[4416]: START: node-state pid=14708 from=::ffff:10.141.255.254
> Sep 20 16:17:41 node004 xinetd[4416]: EXIT: node-state status=0 pid=14708 duration=0(sec)
> On Sep 20, 2012, at 4:14 PM, Andrew Holway wrote:
> 
> > Aslo,
> > 
> > IOzone gave this error: Error writing block 29813, fd= 3
> > 
> > GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Trying to acquire journal lock...
> > GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Looking at journal...
> > GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Acquiring the transaction lock...
> > GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Replaying journal...
> > 
> > 
> > GFS seemed to repair itself and things carried on working.
> > 
> > thanks,
> > 
> > Andrew
> > 
> > On Sep 20, 2012, at 4:08 PM, Andrew Holway wrote:
> > 
> >> Hello,
> >> 
> >> I have set up a 4 node cluster. They are interconnected with an IPoIB (connected mode)
> >> 
> >> Whist running a benchmark with IOzone I got the following errors:
> >> 
> >> IO seems to have halted.
> >> 
> >> Thanks,
> >> 
> >> Andrew
> >> 
> >> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15816 blocked for more than 120 seconds.
> >> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15816  15374 0x00000080
> >> Sep 20 16:01:57 node001 kernel: ffff880fd5ebbac8 0000000000000086 ffff880fd5ebba38 ffffffff81276b66
> >> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ebba58 ffffffff81091f97
> >> Sep 20 16:01:57 node001 kernel: ffff880fe238c638 ffff880fd5ebbfd8 000000000000fb88 ffff880fe238c638
> >> Sep 20 16:01:57 node001 kernel: Call Trace:
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15818 blocked for more than 120 seconds.
> >> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15818  15374 0x00000080
> >> Sep 20 16:01:57 node001 kernel: ffff880fbe5b3ac8 0000000000000082 0000000000000000 ffff881ff95587a0
> >> Sep 20 16:01:57 node001 kernel: ffff881000000002 ffff88100ee13048 00000000be5b3a58 00000040ffffffff
> >> Sep 20 16:01:57 node001 kernel: ffff88100eed1ab8 ffff880fbe5b3fd8 000000000000fb88 ffff88100eed1ab8
> >> Sep 20 16:01:57 node001 kernel: Call Trace:
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15820 blocked for more than 120 seconds.
> >> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000008     0 15820  15374 0x00000080
> >> Sep 20 16:01:57 node001 kernel: ffff880ffed7bac8 0000000000000086 0000000000000000 ffffffff81276b66
> >> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000fed7ba58 00000040ffffffff
> >> Sep 20 16:01:57 node001 kernel: ffff880fbdd51ab8 ffff880ffed7bfd8 000000000000fb88 ffff880fbdd51ab8
> >> Sep 20 16:01:57 node001 kernel: Call Trace:
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15822 blocked for more than 120 seconds.
> >> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15822  15374 0x00000080
> >> Sep 20 16:01:57 node001 kernel: ffff880fd5dddac8 0000000000000086 0000000000000000 ffffffff81276b66
> >> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ddda58 ffffffff81091f97
> >> Sep 20 16:01:57 node001 kernel: ffff880fe238d098 ffff880fd5dddfd8 000000000000fb88 ffff880fe238d098
> >> Sep 20 16:01:57 node001 kernel: Call Trace:
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15824 blocked for more than 120 seconds.
> >> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15824  15374 0x00000080
> >> Sep 20 16:01:57 node001 kernel: ffff880fbe5edac8 0000000000000086 0000000000000000 ffffffff81276b66
> >> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000be5eda58 00000040ffffffff
> >> Sep 20 16:01:57 node001 kernel: ffff880ff69085f8 ffff880fbe5edfd8 000000000000fb88 ffff880ff69085f8
> >> Sep 20 16:01:57 node001 kernel: Call Trace:
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15826 blocked for more than 120 seconds.
> >> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15826  15374 0x00000080
> >> Sep 20 16:01:57 node001 kernel: ffff880fbe7cfac8 0000000000000086 0000000000000000 ffffffff81276b66
> >> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fbe7cfa58 ffffffff81091f97
> >> Sep 20 16:01:57 node001 kernel: ffff88100ddcbab8 ffff880fbe7cffd8 000000000000fb88 ffff88100ddcbab8
> >> Sep 20 16:01:57 node001 kernel: Call Trace:
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15828 blocked for more than 120 seconds.
> >> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15828  15374 0x00000080
> >> Sep 20 16:01:57 node001 kernel: ffff88100684bac8 0000000000000086 0000000000000000 ffffffff81276b66
> >> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 000000000684ba58 00000040ffffffff
> >> Sep 20 16:01:57 node001 kernel: ffff88100edc7af8 ffff88100684bfd8 000000000000fb88 ffff88100edc7af8
> >> Sep 20 16:01:57 node001 kernel: Call Trace:
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15830 blocked for more than 120 seconds.
> >> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000000     0 15830  15374 0x00000080
> >> Sep 20 16:01:57 node001 kernel: ffff880fbdd0fac8 0000000000000082 0000000000000000 ffffffff81276b66
> >> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000bdd0fa58 00000040ffffffff
> >> Sep 20 16:01:57 node001 kernel: ffff88100de93098 ffff880fbdd0ffd8 000000000000fb88 ffff88100de93098
> >> Sep 20 16:01:57 node001 kernel: Call Trace:
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >> 
> >> 
> > 
> 
> 
> 


From a.holway at syseleven.de  Fri Sep 21 08:37:55 2012
From: a.holway at syseleven.de (Andrew Holway)
Date: Fri, 21 Sep 2012 10:37:55 +0200
Subject: [Linux-cluster] GFS fail with iozone
In-Reply-To: <1348216451.2746.5.camel@menhir>
References: <6BC17920-7E5C-44A8-B434-C710CBD7F864@syseleven.de>
	<F1DE4351-2CF4-4D96-93DD-F128DB6F4252@syseleven.de>
	<AA5A5B0B-8CCB-40D1-8A64-873A55111750@syseleven.de>
	<1348216451.2746.5.camel@menhir>
Message-ID: <866F1EEC-9187-4805-B980-39F94384D40C@syseleven.de>

Hi,

I am using a nimble storage box.

It looks like I used the same initiatior name on all of my nodes :)

I will test again.

Thanks,

Andrew


On Sep 21, 2012, at 10:34 AM, Steven Whitehouse wrote:

> Hi,
> 
> On Thu, 2012-09-20 at 16:25 +0200, Andrew Holway wrote:
>> It seems that my node004 is the problem.
>> 
>> I cannot kill the iozone processes and I find this in the logs.
>> 
> This looks like there is some problem with the i/o stack below the level
> of GFS2. What kind of storage are you using? If this is a JBOD then
> perhaps there is a faulty disk or something like that?
> 
> Steve.
> 
>> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 94 88 a0 00 00 20 00
>> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 51 ff 90 00 00 20 00
>> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 46 d5 c0 00 00 20 00
>> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 03 da c7 78 00 00 20 00
>> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 f5 8f 60 00 00 20 00
>> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 30 7c 90 00 00 20 00
>> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 05 79 8b e0 00 00 20 00
>> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 04 37 13 08 00 00 20 00
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:02:15 node004 kernel: INFO: task glock_workqueue:9820 blocked for more than 120 seconds.
>> Sep 20 16:02:15 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:02:15 node004 kernel: glock_workque D 000000000000001b     0  9820      2 0x00000080
>> Sep 20 16:02:15 node004 kernel: ffff8820150a9c70 0000000000000046 0000000000000004 00000000aa8f20cf
>> Sep 20 16:02:15 node004 kernel: ffff881fffd050c8 0000000000000441 ffff8820150a9c10 ffffffff811acd5e
>> Sep 20 16:02:15 node004 kernel: ffff882015b39ab8 ffff8820150a9fd8 000000000000fb88 ffff882015b39ab8
>> Sep 20 16:02:15 node004 kernel: Call Trace:
>> Sep 20 16:02:15 node004 kernel: [<ffffffff811acd5e>] ? submit_bh+0x10e/0x150
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8109cd39>] ? ktime_get_ts+0xa9/0xe0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa094aaca>] gfs2_log_flush+0x47a/0x6f0 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa09477d0>] inode_go_sync+0x80/0x160 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946336>] do_xmote+0x156/0x280 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946551>] run_queue+0xf1/0x1d0 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946d2a>] glock_work_func+0x7a/0x1b0 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
>> Sep 20 16:02:15 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
>> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e c0 00 00 20 00
>> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117976
>> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117977
>> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117978
>> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117979
>> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e b8 00 00 08 00
>> Sep 20 16:02:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117975
>> Sep 20 16:02:22 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e e0 00 00 08 00
>> Sep 20 16:05:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117980
>> Sep 20 16:05:22 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: fatal: I/O error
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   block = 117980
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   function = log_write_header, file = fs/gfs2/log.c, line = 616
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: about to withdraw this file system
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: telling LM to unmount
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: withdrawn
>> Sep 20 16:05:22 node004 kernel: Pid: 9820, comm: glock_workqueue Not tainted 2.6.32-279.el6.x86_64 #1
>> Sep 20 16:05:22 node004 kernel: Call Trace:
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0962062>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff814fea28>] ? out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:05:22 node004 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa09620d0>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff811adfb6>] ? __wait_on_buffer+0x26/0x30
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa094a288>] ? log_write_header+0x3a8/0x490 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa094a951>] ? gfs2_log_flush+0x301/0x6f0 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa09477d0>] ? inode_go_sync+0x80/0x160 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946336>] ? do_xmote+0x156/0x280 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946551>] ? run_queue+0xf1/0x1d0 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946d2a>] ? glock_work_func+0x7a/0x1b0 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
>> Sep 20 16:05:22 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
>> Sep 20 16:05:22 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
>> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 00 08 b0 00 00 08 00
>> Sep 20 16:08:22 node004 kernel: Buffer I/O error on device dm-6, logical block 22
>> Sep 20 16:08:22 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:16:05 node004 xinetd[4416]: START: node-state pid=14578 from=::ffff:10.141.255.254
>> Sep 20 16:16:05 node004 xinetd[4416]: EXIT: node-state status=0 pid=14578 duration=0(sec)
>> Sep 20 16:17:34 node004 xinetd[4416]: START: node-state pid=14653 from=::ffff:10.141.255.254
>> Sep 20 16:17:34 node004 xinetd[4416]: EXIT: node-state status=0 pid=14653 duration=0(sec)
>> Sep 20 16:17:36 node004 xinetd[4416]: START: node-state pid=14671 from=::ffff:10.141.255.254
>> Sep 20 16:17:36 node004 xinetd[4416]: EXIT: node-state status=0 pid=14671 duration=0(sec)
>> Sep 20 16:17:39 node004 xinetd[4416]: START: node-state pid=14690 from=::ffff:10.141.255.254
>> Sep 20 16:17:39 node004 xinetd[4416]: EXIT: node-state status=0 pid=14690 duration=0(sec)
>> Sep 20 16:17:41 node004 xinetd[4416]: START: node-state pid=14708 from=::ffff:10.141.255.254
>> Sep 20 16:17:41 node004 xinetd[4416]: EXIT: node-state status=0 pid=14708 duration=0(sec)
>> On Sep 20, 2012, at 4:14 PM, Andrew Holway wrote:
>> 
>>> Aslo,
>>> 
>>> IOzone gave this error: Error writing block 29813, fd= 3
>>> 
>>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Trying to acquire journal lock...
>>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Looking at journal...
>>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Acquiring the transaction lock...
>>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Replaying journal...
>>> 
>>> 
>>> GFS seemed to repair itself and things carried on working.
>>> 
>>> thanks,
>>> 
>>> Andrew
>>> 
>>> On Sep 20, 2012, at 4:08 PM, Andrew Holway wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I have set up a 4 node cluster. They are interconnected with an IPoIB (connected mode)
>>>> 
>>>> Whist running a benchmark with IOzone I got the following errors:
>>>> 
>>>> IO seems to have halted.
>>>> 
>>>> Thanks,
>>>> 
>>>> Andrew
>>>> 
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15816 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15816  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fd5ebbac8 0000000000000086 ffff880fd5ebba38 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ebba58 ffffffff81091f97
>>>> Sep 20 16:01:57 node001 kernel: ffff880fe238c638 ffff880fd5ebbfd8 000000000000fb88 ffff880fe238c638
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15818 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15818  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbe5b3ac8 0000000000000082 0000000000000000 ffff881ff95587a0
>>>> Sep 20 16:01:57 node001 kernel: ffff881000000002 ffff88100ee13048 00000000be5b3a58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff88100eed1ab8 ffff880fbe5b3fd8 000000000000fb88 ffff88100eed1ab8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15820 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000008     0 15820  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880ffed7bac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000fed7ba58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbdd51ab8 ffff880ffed7bfd8 000000000000fb88 ffff880fbdd51ab8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15822 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15822  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fd5dddac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ddda58 ffffffff81091f97
>>>> Sep 20 16:01:57 node001 kernel: ffff880fe238d098 ffff880fd5dddfd8 000000000000fb88 ffff880fe238d098
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15824 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15824  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbe5edac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000be5eda58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff880ff69085f8 ffff880fbe5edfd8 000000000000fb88 ffff880ff69085f8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15826 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15826  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbe7cfac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fbe7cfa58 ffffffff81091f97
>>>> Sep 20 16:01:57 node001 kernel: ffff88100ddcbab8 ffff880fbe7cffd8 000000000000fb88 ffff88100ddcbab8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15828 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15828  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff88100684bac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 000000000684ba58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff88100edc7af8 ffff88100684bfd8 000000000000fb88 ffff88100edc7af8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15830 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000000     0 15830  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbdd0fac8 0000000000000082 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000bdd0fa58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff88100de93098 ffff880fbdd0ffd8 000000000000fb88 ffff88100de93098
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
> 
> 


From Ralf.Aumueller at informatik.uni-stuttgart.de  Fri Sep 21 10:39:12 2012
From: Ralf.Aumueller at informatik.uni-stuttgart.de (Ralf Aumueller)
Date: Fri, 21 Sep 2012 12:39:12 +0200
Subject: [Linux-cluster] Problem with rgmanager / rgmanager #37: Error
 receiving header from 2 sz=0 CTX 0x1f5d420
In-Reply-To: <505B5840.4030200@alteeve.ca>
References: <505B429A.6050701@informatik.uni-stuttgart.de>
	<505B5840.4030200@alteeve.ca>
Message-ID: <505C43D0.2080201@informatik.uni-stuttgart.de>

On 09/20/2012 07:54 PM, Digimer wrote:
> On 09/20/2012 12:21 PM, Ralf Aumueller wrote:
>> Hello,
>>
>> we have a two node CentOS6.2 Cluster (rgmanager-3.0.12.1-5). After a reboot of
>> node2 the cluster won't work as expected. On node2 clustat just say's :
>>
>> clustat:
>> Cluster Status for cluster1 @ Thu Sep 20 17:06:02 2012
>> Member Status: Quorate
>>
>>   Member Name                                                 ID   Status
>>   ------ ----                                                 ---- ------
>>   node1                                                       1 Online
>>   node2                                                       2 Online, Local
>>
>> No services listed, no rgmanager running. Also it is not possible to
>> start/migrate any services to node2.
>>
>> On node1 a clustat lists all configured services + under Status rgmanager on
>> both nodes. On node1 the rgmanager.log has lots of:
>> rgmanager #37: Error receiving header from 2 sz=0 CTX 0x1XXXXXX
>>
>> On node2 the rgmanager.log gives me:
>> rgmanager #34: Cannot get status for service ...
>>
>> I did not change the cluster.conf. Only change on node2 was: +48MB and an new
>> BIOS version -- recommend by Dell Support).
>>
>> Best regards,
>> Ralf
>>
> 
> Sounds like you hit this bug: 
> http://rhn.redhat.com/errata/RHBA-2012-0897.html
> 
> Update rgmanager to rgmanager-3.0.12.1-12 and you should be ok.

Did an update of rgmanager on both nodes. Just stopping/starting the
cluster-services didn't revolve the problem. A shutdown of both nodes an then a
restart solves the problem.

Thanks and best regards,
Ralf


From a.holway at syseleven.de  Fri Sep 21 13:48:23 2012
From: a.holway at syseleven.de (Andrew Holway)
Date: Fri, 21 Sep 2012 15:48:23 +0200
Subject: [Linux-cluster] GFS fail with iozone
In-Reply-To: <1348216451.2746.5.camel@menhir>
References: <6BC17920-7E5C-44A8-B434-C710CBD7F864@syseleven.de>
	<F1DE4351-2CF4-4D96-93DD-F128DB6F4252@syseleven.de>
	<AA5A5B0B-8CCB-40D1-8A64-873A55111750@syseleven.de>
	<1348216451.2746.5.camel@menhir>
Message-ID: <745313D9-64FC-4E38-ADE7-5967978772F2@syseleven.de>


On Sep 21, 2012, at 10:34 AM, Steven Whitehouse wrote:

> Hi,
> 
> On Thu, 2012-09-20 at 16:25 +0200, Andrew Holway wrote:
>> It seems that my node004 is the problem.
>> 
>> I cannot kill the iozone processes and I find this in the logs.
>> 
> This looks like there is some problem with the i/o stack below the level
> of GFS2. What kind of storage are you using? If this is a JBOD then
> perhaps there is a faulty disk or something like that?

Why do you say that?

It did it again. but I have no indication from my storage brick that I have an issue. It does appear that it was the same node (node004) that caused the issue again.

The other three stopped doing IO for some time and then resumed.

The node004 died completely

I will run with loglevel=TRACE now.

Thanks,

Andrew

node004 messages
Sep 21 11:28:50 node004 dlm_controld[22407]: dlm_controld 3.0.12.1 started
Sep 21 11:28:51 node004 gfs_controld[22456]: gfs_controld 3.0.12.1 started
Sep 21 11:28:59 node004 kernel: dlm: Using TCP for communications
Sep 21 11:29:00 node004 clvmd: Cluster LVM daemon started - connected to CMAN
Sep 21 11:29:00 node004 kernel: dlm: connecting to 2
Sep 21 11:29:00 node004 kernel: dlm: connecting to 1
Sep 21 11:29:01 node004 kernel: dlm: connecting to 3
Sep 21 11:30:06 node004 kernel: GFS2 (built Jun 22 2012 12:21:46) installed
Sep 21 11:30:06 node004 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "nimble_cluster:gfs_test"
Sep 21 11:30:06 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: Joined cluster. Now mounting FS...
Sep 21 11:30:07 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: jid=2, already locked for use
Sep 21 11:30:07 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: jid=2: Looking at journal...
Sep 21 11:30:07 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: jid=2: Done
Sep 21 11:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 11:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 11:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 11:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 11:38:24 node004 xinetd[5015]: START: node-state pid=23019 from=::ffff:10.141.255.254
Sep 21 11:38:24 node004 xinetd[5015]: EXIT: node-state status=0 pid=23019 duration=0(sec)
Sep 21 11:39:23 node004 xinetd[5015]: START: node-state pid=23038 from=::ffff:10.141.255.254
Sep 21 11:39:23 node004 xinetd[5015]: EXIT: node-state status=0 pid=23038 duration=0(sec)
Sep 21 11:39:25 node004 xinetd[5015]: START: node-state pid=23057 from=::ffff:10.141.255.254
Sep 21 11:39:25 node004 xinetd[5015]: EXIT: node-state status=0 pid=23057 duration=0(sec)
Sep 21 11:39:40 node004 xinetd[5015]: START: node-state pid=23075 from=::ffff:10.141.255.254
Sep 21 11:39:40 node004 xinetd[5015]: EXIT: node-state status=0 pid=23075 duration=0(sec)
Sep 21 11:39:45 node004 xinetd[5015]: START: node-state pid=23097 from=::ffff:10.141.255.254
Sep 21 11:39:45 node004 xinetd[5015]: EXIT: node-state status=0 pid=23097 duration=0(sec)
Sep 21 11:39:53 node004 xinetd[5015]: START: node-state pid=23119 from=::ffff:10.141.255.254
Sep 21 11:39:53 node004 xinetd[5015]: EXIT: node-state status=0 pid=23119 duration=0(sec)
Sep 21 11:39:54 node004 rpc.statd[23170]: Version 1.2.3 starting
Sep 21 11:39:54 node004 sm-notify[23171]: Version 1.2.3 starting
Sep 21 11:40:50 node004 rpc.statd[23215]: Version 1.2.3 starting
Sep 21 11:40:50 node004 sm-notify[23216]: Version 1.2.3 starting
Sep 21 12:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 12:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 12:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 12:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 12:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 12:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 12:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 12:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23804 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23804  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff880dd06db958 0000000000000086 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff880dd06db928 000000004280d602 0000000000000000 ffff880ff308f380
Sep 21 12:54:43 node004 kernel: ffff88100c25b058 ffff880dd06dbfd8 000000000000fb88 ffff88100c25b058
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff81179b51>] ? generic_file_llseek_unlocked+0x1/0x80
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff8117c4b9>] ? fget_light+0x19/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23805 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23805  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff880dd073b958 0000000000000086 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff880dd073b928 000000002c279ff5 0000000000000000 ffff8810048e10c0
Sep 21 12:54:43 node004 kernel: ffff881015abdaf8 ffff880dd073bfd8 000000000000fb88 ffff881015abdaf8
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff811937f0>] ? dput+0x0/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23806 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000008     0 23806  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff880dd070d958 0000000000000086 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff880dd070d928 00000000780f0b6e 0000000000000000 ffff882006c04a40
Sep 21 12:54:43 node004 kernel: ffff88100d04d058 ffff880dd070dfd8 000000000000fb88 ffff88100d04d058
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffff81116681>] ? generic_file_aio_write+0x1/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23807 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23807  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff880dd06ab958 0000000000000082 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff880dd06ab928 00000000b2eadd67 0000000000000000 ffff881004891ec0
Sep 21 12:54:43 node004 kernel: ffff881015ff8638 ffff880dd06abfd8 000000000000fb88 ffff881015ff8638
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff810623da>] ? __cond_resched+0x2a/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff811bb89f>] ? inotify_inode_queue_event+0x2f/0x120
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23808 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23808  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff88100eb19958 0000000000000082 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff88100eb19928 00000000291dd6ee 0000000000000000 ffff881004891a40
Sep 21 12:54:43 node004 kernel: ffff88100bb3b098 ffff88100eb19fd8 000000000000fb88 ffff88100bb3b098
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff814fea54>] ? mutex_unlock+0x14/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23809 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23809  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff880fe76d3958 0000000000000082 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff880fe76d3928 0000000075c0b5d9 0000000000000000 ffff880eb83d5bc0
Sep 21 12:54:43 node004 kernel: ffff880ff10f65f8 ffff880fe76d3fd8 000000000000fb88 ffff880ff10f65f8
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23810 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23810  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff880e4e045958 0000000000000086 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff880e4e045928 00000000ab0cddda 0000000000000000 ffff88100d2fd6c0
Sep 21 12:54:43 node004 kernel: ffff8810079ba5f8 ffff880e4e045fd8 000000000000fb88 ffff8810079ba5f8
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff811aa034>] ? generic_write_sync+0x24/0x50
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdee>] ? reschedule_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23811 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000008     0 23811  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff88100c499958 0000000000000086 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff88100c499928 000000004b5eff48 0000000000000000 ffff88201678ac80
Sep 21 12:54:43 node004 kernel: ffff88100db61ab8 ffff88100c499fd8 000000000000fb88 ffff88100db61ab8
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff811aa01d>] ? generic_write_sync+0xd/0x50
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23813 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23813  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff880dd073f958 0000000000000086 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff880dd073f928 00000000f66794ee 0000000000000000 ffff881007a7ba80
Sep 21 12:54:43 node004 kernel: ffff880fe78c7af8 ffff880dd073ffd8 000000000000fb88 ffff880fe78c7af8
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff814fea41>] ? mutex_unlock+0x1/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff811937f0>] ? dput+0x0/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:54:43 node004 kernel: INFO: task iozone:23814 blocked for more than 120 seconds.
Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23814  22911 0x00000080
Sep 21 12:54:43 node004 kernel: ffff880e4e119958 0000000000000082 0000000000000000 ffffffffa01bf1fc
Sep 21 12:54:43 node004 kernel: ffff880e4e119928 00000000fa09a3ba 0000000000000000 ffff88100d2fd300
Sep 21 12:54:43 node004 kernel: ffff880fe78c7098 ffff880e4e119fd8 000000000000fb88 ffff880fe78c7098
Sep 21 12:54:43 node004 kernel: Call Trace:
Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff811aa034>] ? generic_write_sync+0x24/0x50
Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Sep 21 12:55:38 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:38 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:38 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:38 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 08 69 0a 88 00 00 20 00
Sep 21 12:55:39 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:39 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:39 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:39 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 03 ce 6a c0 00 00 20 00
Sep 21 12:55:41 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:41 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:41 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:41 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 81 f2 28 00 00 20 00
Sep 21 12:55:46 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:46 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:46 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:46 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 06 e6 0b 58 00 00 20 00
Sep 21 12:55:47 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:47 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:47 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:47 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 06 48 cb 48 00 00 20 00
Sep 21 12:55:49 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:49 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:49 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:49 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 07 64 ad 88 00 00 20 00
Sep 21 12:55:50 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:50 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:50 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:50 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 05 79 27 80 00 00 20 00
Sep 21 12:55:52 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:52 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:52 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:52 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 6c 26 60 00 00 20 00
Sep 21 12:55:54 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:54 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:54 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:54 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 04 de 20 c8 00 00 20 00
Sep 21 12:55:55 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:55 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:55 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:55 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 03 2a 6c 20 00 00 20 00
Sep 21 12:55:57 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:57 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:57 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:57 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 92 25 28 00 00 20 00
Sep 21 12:55:58 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:55:58 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:55:58 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:55:58 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 06 03 33 90 00 00 20 00
Sep 21 12:56:00 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:56:00 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:56:00 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:56:00 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 23 7e 40 00 00 20 00
Sep 21 12:56:02 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:56:02 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:56:02 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:56:02 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 07 73 c9 18 00 00 20 00
Sep 21 12:56:03 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:56:03 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:56:03 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:56:03 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 4f 6a a8 00 00 20 00
Sep 21 12:56:05 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:56:05 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:56:05 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:56:05 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 04 3e 9c c8 00 00 20 00
Sep 21 12:59:05 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 12:59:05 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 12:59:05 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 12:59:05 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 00 09 ea c8 00 00 28 00
Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80985
Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0
Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80986
Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0
Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80987
Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0
Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80988
Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0
Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80989
Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0
Sep 21 13:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 13:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 13:02:05 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
Sep 21 13:02:05 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
Sep 21 13:02:05 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 21 13:02:05 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 00 09 ea f0 00 00 08 00
Sep 21 13:02:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80990
Sep 21 13:02:05 node004 kernel: lost page write due to I/O error on dm-0
Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: fatal: I/O error
Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2:   block = 80990
Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2:   function = log_write_header, file = fs/gfs2/log.c, line = 616
Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: about to withdraw this file system
Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: telling LM to unmount
Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: withdrawn
Sep 21 13:02:05 node004 kernel: Pid: 22758, comm: glock_workqueue Not tainted 2.6.32-279.el6.x86_64 #1
Sep 21 13:02:05 node004 kernel: Call Trace:
Sep 21 13:02:05 node004 kernel: [<ffffffffa09ad062>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
Sep 21 13:02:05 node004 kernel: [<ffffffff814fea28>] ? out_of_line_wait_on_bit+0x78/0x90
Sep 21 13:02:05 node004 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
Sep 21 13:02:05 node004 kernel: [<ffffffffa09ad0d0>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
Sep 21 13:02:05 node004 kernel: [<ffffffff811adfb6>] ? __wait_on_buffer+0x26/0x30
Sep 21 13:02:05 node004 kernel: [<ffffffffa0995288>] ? log_write_header+0x3a8/0x490 [gfs2]
Sep 21 13:02:05 node004 kernel: [<ffffffffa0995951>] ? gfs2_log_flush+0x301/0x6f0 [gfs2]
Sep 21 13:02:05 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
Sep 21 13:02:05 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 13:02:05 node004 kernel: [<ffffffffa09927d0>] ? inode_go_sync+0x80/0x160 [gfs2]
Sep 21 13:02:05 node004 kernel: [<ffffffffa0991336>] ? do_xmote+0x156/0x280 [gfs2]
Sep 21 13:02:05 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
Sep 21 13:02:05 node004 kernel: [<ffffffffa0991551>] ? run_queue+0xf1/0x1d0 [gfs2]
Sep 21 13:02:05 node004 kernel: [<ffffffffa0991d2a>] ? glock_work_func+0x7a/0x1b0 [gfs2]
Sep 21 13:02:05 node004 kernel: [<ffffffffa0991cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
Sep 21 13:02:05 node004 kernel: [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
Sep 21 13:02:05 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Sep 21 13:02:05 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
Sep 21 13:02:05 node004 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
Sep 21 13:02:05 node004 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
Sep 21 13:02:05 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Sep 21 13:02:05 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Sep 21 13:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 13:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 13:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 13:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 14:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 14:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 14:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 14:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 14:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 14:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 14:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 14:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 15:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 15:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 15:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 15:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 15:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 15:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 15:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
Sep 21 15:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]


> 
> Steve.
> 
>> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 94 88 a0 00 00 20 00
>> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 51 ff 90 00 00 20 00
>> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 46 d5 c0 00 00 20 00
>> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 03 da c7 78 00 00 20 00
>> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 f5 8f 60 00 00 20 00
>> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 30 7c 90 00 00 20 00
>> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 05 79 8b e0 00 00 20 00
>> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 04 37 13 08 00 00 20 00
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
>> Sep 20 16:02:15 node004 kernel: INFO: task glock_workqueue:9820 blocked for more than 120 seconds.
>> Sep 20 16:02:15 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Sep 20 16:02:15 node004 kernel: glock_workque D 000000000000001b     0  9820      2 0x00000080
>> Sep 20 16:02:15 node004 kernel: ffff8820150a9c70 0000000000000046 0000000000000004 00000000aa8f20cf
>> Sep 20 16:02:15 node004 kernel: ffff881fffd050c8 0000000000000441 ffff8820150a9c10 ffffffff811acd5e
>> Sep 20 16:02:15 node004 kernel: ffff882015b39ab8 ffff8820150a9fd8 000000000000fb88 ffff882015b39ab8
>> Sep 20 16:02:15 node004 kernel: Call Trace:
>> Sep 20 16:02:15 node004 kernel: [<ffffffff811acd5e>] ? submit_bh+0x10e/0x150
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8109cd39>] ? ktime_get_ts+0xa9/0xe0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa094aaca>] gfs2_log_flush+0x47a/0x6f0 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa09477d0>] inode_go_sync+0x80/0x160 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946336>] do_xmote+0x156/0x280 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946551>] run_queue+0xf1/0x1d0 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946d2a>] glock_work_func+0x7a/0x1b0 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
>> Sep 20 16:02:15 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
>> Sep 20 16:02:15 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
>> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e c0 00 00 20 00
>> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117976
>> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117977
>> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117978
>> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117979
>> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e b8 00 00 08 00
>> Sep 20 16:02:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117975
>> Sep 20 16:02:22 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e e0 00 00 08 00
>> Sep 20 16:05:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117980
>> Sep 20 16:05:22 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: fatal: I/O error
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   block = 117980
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   function = log_write_header, file = fs/gfs2/log.c, line = 616
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: about to withdraw this file system
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: telling LM to unmount
>> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: withdrawn
>> Sep 20 16:05:22 node004 kernel: Pid: 9820, comm: glock_workqueue Not tainted 2.6.32-279.el6.x86_64 #1
>> Sep 20 16:05:22 node004 kernel: Call Trace:
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0962062>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff814fea28>] ? out_of_line_wait_on_bit+0x78/0x90
>> Sep 20 16:05:22 node004 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa09620d0>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff811adfb6>] ? __wait_on_buffer+0x26/0x30
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa094a288>] ? log_write_header+0x3a8/0x490 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa094a951>] ? gfs2_log_flush+0x301/0x6f0 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa09477d0>] ? inode_go_sync+0x80/0x160 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946336>] ? do_xmote+0x156/0x280 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946551>] ? run_queue+0xf1/0x1d0 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946d2a>] ? glock_work_func+0x7a/0x1b0 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
>> Sep 20 16:05:22 node004 kernel: [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
>> Sep 20 16:05:22 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
>> Sep 20 16:05:22 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
>> Sep 20 16:05:22 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
>> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
>> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
>> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 00 08 b0 00 00 08 00
>> Sep 20 16:08:22 node004 kernel: Buffer I/O error on device dm-6, logical block 22
>> Sep 20 16:08:22 node004 kernel: lost page write due to I/O error on dm-6
>> Sep 20 16:16:05 node004 xinetd[4416]: START: node-state pid=14578 from=::ffff:10.141.255.254
>> Sep 20 16:16:05 node004 xinetd[4416]: EXIT: node-state status=0 pid=14578 duration=0(sec)
>> Sep 20 16:17:34 node004 xinetd[4416]: START: node-state pid=14653 from=::ffff:10.141.255.254
>> Sep 20 16:17:34 node004 xinetd[4416]: EXIT: node-state status=0 pid=14653 duration=0(sec)
>> Sep 20 16:17:36 node004 xinetd[4416]: START: node-state pid=14671 from=::ffff:10.141.255.254
>> Sep 20 16:17:36 node004 xinetd[4416]: EXIT: node-state status=0 pid=14671 duration=0(sec)
>> Sep 20 16:17:39 node004 xinetd[4416]: START: node-state pid=14690 from=::ffff:10.141.255.254
>> Sep 20 16:17:39 node004 xinetd[4416]: EXIT: node-state status=0 pid=14690 duration=0(sec)
>> Sep 20 16:17:41 node004 xinetd[4416]: START: node-state pid=14708 from=::ffff:10.141.255.254
>> Sep 20 16:17:41 node004 xinetd[4416]: EXIT: node-state status=0 pid=14708 duration=0(sec)
>> On Sep 20, 2012, at 4:14 PM, Andrew Holway wrote:
>> 
>>> Aslo,
>>> 
>>> IOzone gave this error: Error writing block 29813, fd= 3
>>> 
>>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Trying to acquire journal lock...
>>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Looking at journal...
>>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Acquiring the transaction lock...
>>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Replaying journal...
>>> 
>>> 
>>> GFS seemed to repair itself and things carried on working.
>>> 
>>> thanks,
>>> 
>>> Andrew
>>> 
>>> On Sep 20, 2012, at 4:08 PM, Andrew Holway wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I have set up a 4 node cluster. They are interconnected with an IPoIB (connected mode)
>>>> 
>>>> Whist running a benchmark with IOzone I got the following errors:
>>>> 
>>>> IO seems to have halted.
>>>> 
>>>> Thanks,
>>>> 
>>>> Andrew
>>>> 
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15816 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15816  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fd5ebbac8 0000000000000086 ffff880fd5ebba38 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ebba58 ffffffff81091f97
>>>> Sep 20 16:01:57 node001 kernel: ffff880fe238c638 ffff880fd5ebbfd8 000000000000fb88 ffff880fe238c638
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15818 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15818  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbe5b3ac8 0000000000000082 0000000000000000 ffff881ff95587a0
>>>> Sep 20 16:01:57 node001 kernel: ffff881000000002 ffff88100ee13048 00000000be5b3a58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff88100eed1ab8 ffff880fbe5b3fd8 000000000000fb88 ffff88100eed1ab8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15820 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000008     0 15820  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880ffed7bac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000fed7ba58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbdd51ab8 ffff880ffed7bfd8 000000000000fb88 ffff880fbdd51ab8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15822 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15822  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fd5dddac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ddda58 ffffffff81091f97
>>>> Sep 20 16:01:57 node001 kernel: ffff880fe238d098 ffff880fd5dddfd8 000000000000fb88 ffff880fe238d098
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15824 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15824  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbe5edac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000be5eda58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff880ff69085f8 ffff880fbe5edfd8 000000000000fb88 ffff880ff69085f8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15826 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15826  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbe7cfac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fbe7cfa58 ffffffff81091f97
>>>> Sep 20 16:01:57 node001 kernel: ffff88100ddcbab8 ffff880fbe7cffd8 000000000000fb88 ffff88100ddcbab8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15828 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15828  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff88100684bac8 0000000000000086 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 000000000684ba58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff88100edc7af8 ffff88100684bfd8 000000000000fb88 ffff88100edc7af8
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15830 blocked for more than 120 seconds.
>>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000000     0 15830  15374 0x00000080
>>>> Sep 20 16:01:57 node001 kernel: ffff880fbdd0fac8 0000000000000082 0000000000000000 ffffffff81276b66
>>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000bdd0fa58 00000040ffffffff
>>>> Sep 20 16:01:57 node001 kernel: ffff88100de93098 ffff880fbdd0ffd8 000000000000fb88 ffff88100de93098
>>>> Sep 20 16:01:57 node001 kernel: Call Trace:
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
>>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
> 
> 


From swhiteho at redhat.com  Fri Sep 21 13:56:32 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Fri, 21 Sep 2012 14:56:32 +0100
Subject: [Linux-cluster] GFS fail with iozone
In-Reply-To: <745313D9-64FC-4E38-ADE7-5967978772F2@syseleven.de>
References: <6BC17920-7E5C-44A8-B434-C710CBD7F864@syseleven.de>
	<F1DE4351-2CF4-4D96-93DD-F128DB6F4252@syseleven.de>
	<AA5A5B0B-8CCB-40D1-8A64-873A55111750@syseleven.de>
	<1348216451.2746.5.camel@menhir>
	<745313D9-64FC-4E38-ADE7-5967978772F2@syseleven.de>
Message-ID: <1348235792.2746.42.camel@menhir>

Hi,

On Fri, 2012-09-21 at 15:48 +0200, Andrew Holway wrote:
> On Sep 21, 2012, at 10:34 AM, Steven Whitehouse wrote:
> 
> > Hi,
> > 
> > On Thu, 2012-09-20 at 16:25 +0200, Andrew Holway wrote:
> >> It seems that my node004 is the problem.
> >> 
> >> I cannot kill the iozone processes and I find this in the logs.
> >> 
> > This looks like there is some problem with the i/o stack below the level
> > of GFS2. What kind of storage are you using? If this is a JBOD then
> > perhaps there is a faulty disk or something like that?
> 
> Why do you say that?
> 
Based on your logs below....

> It did it again. but I have no indication from my storage brick that I have an issue. It does appear that it was the same node (node004) that caused the issue again.
> 
> The other three stopped doing IO for some time and then resumed.
> 
> The node004 died completely
> 
> I will run with loglevel=TRACE now.
> 
> Thanks,
> 
> Andrew
> 
> node004 messages
> Sep 21 11:28:50 node004 dlm_controld[22407]: dlm_controld 3.0.12.1 started
> Sep 21 11:28:51 node004 gfs_controld[22456]: gfs_controld 3.0.12.1 started
> Sep 21 11:28:59 node004 kernel: dlm: Using TCP for communications
> Sep 21 11:29:00 node004 clvmd: Cluster LVM daemon started - connected to CMAN
> Sep 21 11:29:00 node004 kernel: dlm: connecting to 2
> Sep 21 11:29:00 node004 kernel: dlm: connecting to 1
> Sep 21 11:29:01 node004 kernel: dlm: connecting to 3
> Sep 21 11:30:06 node004 kernel: GFS2 (built Jun 22 2012 12:21:46) installed
> Sep 21 11:30:06 node004 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "nimble_cluster:gfs_test"
> Sep 21 11:30:06 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: Joined cluster. Now mounting FS...
> Sep 21 11:30:07 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: jid=2, already locked for use
> Sep 21 11:30:07 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: jid=2: Looking at journal...
> Sep 21 11:30:07 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: jid=2: Done
> Sep 21 11:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 11:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 11:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 11:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 11:38:24 node004 xinetd[5015]: START: node-state pid=23019 from=::ffff:10.141.255.254
> Sep 21 11:38:24 node004 xinetd[5015]: EXIT: node-state status=0 pid=23019 duration=0(sec)
> Sep 21 11:39:23 node004 xinetd[5015]: START: node-state pid=23038 from=::ffff:10.141.255.254
> Sep 21 11:39:23 node004 xinetd[5015]: EXIT: node-state status=0 pid=23038 duration=0(sec)
> Sep 21 11:39:25 node004 xinetd[5015]: START: node-state pid=23057 from=::ffff:10.141.255.254
> Sep 21 11:39:25 node004 xinetd[5015]: EXIT: node-state status=0 pid=23057 duration=0(sec)
> Sep 21 11:39:40 node004 xinetd[5015]: START: node-state pid=23075 from=::ffff:10.141.255.254
> Sep 21 11:39:40 node004 xinetd[5015]: EXIT: node-state status=0 pid=23075 duration=0(sec)
> Sep 21 11:39:45 node004 xinetd[5015]: START: node-state pid=23097 from=::ffff:10.141.255.254
> Sep 21 11:39:45 node004 xinetd[5015]: EXIT: node-state status=0 pid=23097 duration=0(sec)
> Sep 21 11:39:53 node004 xinetd[5015]: START: node-state pid=23119 from=::ffff:10.141.255.254
> Sep 21 11:39:53 node004 xinetd[5015]: EXIT: node-state status=0 pid=23119 duration=0(sec)
> Sep 21 11:39:54 node004 rpc.statd[23170]: Version 1.2.3 starting
> Sep 21 11:39:54 node004 sm-notify[23171]: Version 1.2.3 starting
> Sep 21 11:40:50 node004 rpc.statd[23215]: Version 1.2.3 starting
> Sep 21 11:40:50 node004 sm-notify[23216]: Version 1.2.3 starting
> Sep 21 12:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 12:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 12:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 12:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 12:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 12:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 12:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 12:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]

This is a SCSI error of some kind....

> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23804 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23804  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff880dd06db958 0000000000000086 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff880dd06db928 000000004280d602 0000000000000000 ffff880ff308f380
> Sep 21 12:54:43 node004 kernel: ffff88100c25b058 ffff880dd06dbfd8 000000000000fb88 ffff88100c25b058
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff81179b51>] ? generic_file_llseek_unlocked+0x1/0x80
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117c4b9>] ? fget_light+0x19/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

The above is GFS2 getting stuck doing a direct i/o write, and the reason
that it is stuck is that it is waiting for an i/o request to complete.


> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23805 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23805  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff880dd073b958 0000000000000086 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff880dd073b928 000000002c279ff5 0000000000000000 ffff8810048e10c0
> Sep 21 12:54:43 node004 kernel: ffff881015abdaf8 ffff880dd073bfd8 000000000000fb88 ffff881015abdaf8
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811937f0>] ? dput+0x0/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23806 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000008     0 23806  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff880dd070d958 0000000000000086 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff880dd070d928 00000000780f0b6e 0000000000000000 ffff882006c04a40
> Sep 21 12:54:43 node004 kernel: ffff88100d04d058 ffff880dd070dfd8 000000000000fb88 ffff88100d04d058
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116681>] ? generic_file_aio_write+0x1/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23807 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23807  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff880dd06ab958 0000000000000082 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff880dd06ab928 00000000b2eadd67 0000000000000000 ffff881004891ec0
> Sep 21 12:54:43 node004 kernel: ffff881015ff8638 ffff880dd06abfd8 000000000000fb88 ffff881015ff8638
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff810623da>] ? __cond_resched+0x2a/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff811bb89f>] ? inotify_inode_queue_event+0x2f/0x120
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23808 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23808  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff88100eb19958 0000000000000082 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff88100eb19928 00000000291dd6ee 0000000000000000 ffff881004891a40
> Sep 21 12:54:43 node004 kernel: ffff88100bb3b098 ffff88100eb19fd8 000000000000fb88 ffff88100bb3b098
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fea54>] ? mutex_unlock+0x14/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23809 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23809  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff880fe76d3958 0000000000000082 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff880fe76d3928 0000000075c0b5d9 0000000000000000 ffff880eb83d5bc0
> Sep 21 12:54:43 node004 kernel: ffff880ff10f65f8 ffff880fe76d3fd8 000000000000fb88 ffff880ff10f65f8
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23810 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23810  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff880e4e045958 0000000000000086 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff880e4e045928 00000000ab0cddda 0000000000000000 ffff88100d2fd6c0
> Sep 21 12:54:43 node004 kernel: ffff8810079ba5f8 ffff880e4e045fd8 000000000000fb88 ffff8810079ba5f8
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff811aa034>] ? generic_write_sync+0x24/0x50
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdee>] ? reschedule_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23811 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000008     0 23811  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff88100c499958 0000000000000086 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff88100c499928 000000004b5eff48 0000000000000000 ffff88201678ac80
> Sep 21 12:54:43 node004 kernel: ffff88100db61ab8 ffff88100c499fd8 000000000000fb88 ffff88100db61ab8
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff811aa01d>] ? generic_write_sync+0xd/0x50
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23813 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23813  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff880dd073f958 0000000000000086 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff880dd073f928 00000000f66794ee 0000000000000000 ffff881007a7ba80
> Sep 21 12:54:43 node004 kernel: ffff880fe78c7af8 ffff880dd073ffd8 000000000000fb88 ffff880fe78c7af8
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fea41>] ? mutex_unlock+0x1/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811937f0>] ? dput+0x0/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> Sep 21 12:54:43 node004 kernel: INFO: task iozone:23814 blocked for more than 120 seconds.
> Sep 21 12:54:43 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 12:54:43 node004 kernel: iozone        D 0000000000000011     0 23814  22911 0x00000080
> Sep 21 12:54:43 node004 kernel: ffff880e4e119958 0000000000000082 0000000000000000 ffffffffa01bf1fc
> Sep 21 12:54:43 node004 kernel: ffff880e4e119928 00000000fa09a3ba 0000000000000000 ffff88100d2fd300
> Sep 21 12:54:43 node004 kernel: ffff880fe78c7098 ffff880e4e119fd8 000000000000fb88 ffff880fe78c7098
> Sep 21 12:54:43 node004 kernel: Call Trace:
> Sep 21 12:54:43 node004 kernel: [<ffffffffa01bf1fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
> Sep 21 12:54:43 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b663e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
> Sep 21 12:54:43 node004 kernel: [<ffffffffa098e570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0990ca8>] ? do_promote+0x208/0x330 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff811b6b2e>] __blockdev_direct_IO+0x5e/0xd0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998990>] gfs2_direct_IO+0x100/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa0998760>] ? gfs2_get_block_direct+0x0/0x20 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffffa09988ec>] ? gfs2_direct_IO+0x5c/0x110 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff81114d32>] generic_file_direct_write+0xc2/0x190
> Sep 21 12:54:43 node004 kernel: [<ffffffff81116545>] __generic_file_aio_write+0x345/0x480
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bdae>] ? call_function_single_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff811aa034>] ? generic_write_sync+0x24/0x50
> Sep 21 12:54:43 node004 kernel: [<ffffffff811166ef>] generic_file_aio_write+0x6f/0xe0
> Sep 21 12:54:43 node004 kernel: [<ffffffffa099b8be>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ac70>] ? do_sync_write+0x0/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ad6a>] do_sync_write+0xfa/0x140
> Sep 21 12:54:43 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 12:54:43 node004 kernel: [<ffffffff8121fd8b>] ? selinux_file_permission+0xfb/0x150
> Sep 21 12:54:43 node004 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
> Sep 21 12:54:43 node004 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
> Sep 21 12:54:43 node004 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b


These bits below are SCSI errors from sdj

> Sep 21 12:55:38 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:38 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:38 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:38 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 08 69 0a 88 00 00 20 00
> Sep 21 12:55:39 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:39 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:39 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:39 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 03 ce 6a c0 00 00 20 00
> Sep 21 12:55:41 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:41 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:41 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:41 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 81 f2 28 00 00 20 00
> Sep 21 12:55:46 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:46 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:46 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:46 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 06 e6 0b 58 00 00 20 00
> Sep 21 12:55:47 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:47 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:47 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:47 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 06 48 cb 48 00 00 20 00
> Sep 21 12:55:49 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:49 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:49 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:49 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 07 64 ad 88 00 00 20 00
> Sep 21 12:55:50 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:50 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:50 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:50 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 05 79 27 80 00 00 20 00
> Sep 21 12:55:52 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:52 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:52 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:52 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 6c 26 60 00 00 20 00
> Sep 21 12:55:54 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:54 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:54 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:54 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 04 de 20 c8 00 00 20 00
> Sep 21 12:55:55 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:55 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:55 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:55 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 03 2a 6c 20 00 00 20 00
> Sep 21 12:55:57 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:57 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:57 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:57 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 92 25 28 00 00 20 00
> Sep 21 12:55:58 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:55:58 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:55:58 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:55:58 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 06 03 33 90 00 00 20 00
> Sep 21 12:56:00 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:56:00 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:56:00 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:56:00 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 23 7e 40 00 00 20 00
> Sep 21 12:56:02 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:56:02 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:56:02 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:56:02 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 07 73 c9 18 00 00 20 00
> Sep 21 12:56:03 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:56:03 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:56:03 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:56:03 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 01 4f 6a a8 00 00 20 00
> Sep 21 12:56:05 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:56:05 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:56:05 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:56:05 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 04 3e 9c c8 00 00 20 00
> Sep 21 12:59:05 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 12:59:05 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 12:59:05 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 12:59:05 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 00 09 ea c8 00 00 28 00
> Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80985
> Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0
> Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80986
> Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0
> Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80987
> Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0
> Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80988
> Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0
> Sep 21 12:59:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80989
> Sep 21 12:59:05 node004 kernel: lost page write due to I/O error on dm-0

Now the buffer layer is complaining that writes are failing to dm-0
presumably as a result of the previously seen errors.


> Sep 21 13:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 13:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 13:02:05 node004 kernel: sd 6:0:0:0: timing out command, waited 180s
> Sep 21 13:02:05 node004 kernel: sd 6:0:0:0: [sdj] Unhandled error code
> Sep 21 13:02:05 node004 kernel: sd 6:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Sep 21 13:02:05 node004 kernel: sd 6:0:0:0: [sdj] CDB: Write(10): 2a 00 00 09 ea f0 00 00 08 00
> Sep 21 13:02:05 node004 kernel: Buffer I/O error on device dm-0, logical block 80990
> Sep 21 13:02:05 node004 kernel: lost page write due to I/O error on dm-0
> Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: fatal: I/O error
> Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2:   block = 80990
> Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2:   function = log_write_header, file = fs/gfs2/log.c, line = 616
> Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: about to withdraw this file system
> Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: telling LM to unmount
> Sep 21 13:02:05 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.2: withdrawn
> Sep 21 13:02:05 node004 kernel: Pid: 22758, comm: glock_workqueue Not tainted 2.6.32-279.el6.x86_64 #1
> Sep 21 13:02:05 node004 kernel: Call Trace:
> Sep 21 13:02:05 node004 kernel: [<ffffffffa09ad062>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
> Sep 21 13:02:05 node004 kernel: [<ffffffff814fea28>] ? out_of_line_wait_on_bit+0x78/0x90
> Sep 21 13:02:05 node004 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> Sep 21 13:02:05 node004 kernel: [<ffffffffa09ad0d0>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
> Sep 21 13:02:05 node004 kernel: [<ffffffff811adfb6>] ? __wait_on_buffer+0x26/0x30
> Sep 21 13:02:05 node004 kernel: [<ffffffffa0995288>] ? log_write_header+0x3a8/0x490 [gfs2]
> Sep 21 13:02:05 node004 kernel: [<ffffffffa0995951>] ? gfs2_log_flush+0x301/0x6f0 [gfs2]
> Sep 21 13:02:05 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
> Sep 21 13:02:05 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 13:02:05 node004 kernel: [<ffffffffa09927d0>] ? inode_go_sync+0x80/0x160 [gfs2]
> Sep 21 13:02:05 node004 kernel: [<ffffffffa0991336>] ? do_xmote+0x156/0x280 [gfs2]
> Sep 21 13:02:05 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
> Sep 21 13:02:05 node004 kernel: [<ffffffffa0991551>] ? run_queue+0xf1/0x1d0 [gfs2]
> Sep 21 13:02:05 node004 kernel: [<ffffffffa0991d2a>] ? glock_work_func+0x7a/0x1b0 [gfs2]
> Sep 21 13:02:05 node004 kernel: [<ffffffffa0991cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
> Sep 21 13:02:05 node004 kernel: [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
> Sep 21 13:02:05 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> Sep 21 13:02:05 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
> Sep 21 13:02:05 node004 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
> Sep 21 13:02:05 node004 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
> Sep 21 13:02:05 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> Sep 21 13:02:05 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> Sep 21 13:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 13:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 13:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 13:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 14:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 14:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 14:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 14:00:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 14:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 14:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 14:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 14:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 15:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 15:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 15:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 15:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 15:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 15:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 15:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> Sep 21 15:30:10 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> 
> 
> 
> > 
> > Steve.
> > 
> >> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 15:59:09 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 94 88 a0 00 00 20 00
> >> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 15:59:11 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 51 ff 90 00 00 20 00
> >> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 15:59:13 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 46 d5 c0 00 00 20 00
> >> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 15:59:14 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 03 da c7 78 00 00 20 00
> >> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 15:59:16 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 06 f5 8f 60 00 00 20 00
> >> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 15:59:17 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 01 30 7c 90 00 00 20 00
> >> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 15:59:19 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 05 79 8b e0 00 00 20 00
> >> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 15:59:20 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 04 37 13 08 00 00 20 00
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[4d 00 40 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:00:11 node004 kernel: hpsa 0000:02:00.0: cp ffff88007f900000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[37 00 0c 00 00 00 00 00 04 00 00 00 00 00 00 00]
> >> Sep 20 16:02:15 node004 kernel: INFO: task glock_workqueue:9820 blocked for more than 120 seconds.
> >> Sep 20 16:02:15 node004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> Sep 20 16:02:15 node004 kernel: glock_workque D 000000000000001b     0  9820      2 0x00000080
> >> Sep 20 16:02:15 node004 kernel: ffff8820150a9c70 0000000000000046 0000000000000004 00000000aa8f20cf
> >> Sep 20 16:02:15 node004 kernel: ffff881fffd050c8 0000000000000441 ffff8820150a9c10 ffffffff811acd5e
> >> Sep 20 16:02:15 node004 kernel: ffff882015b39ab8 ffff8820150a9fd8 000000000000fb88 ffff882015b39ab8
> >> Sep 20 16:02:15 node004 kernel: Call Trace:
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff811acd5e>] ? submit_bh+0x10e/0x150
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff8109cd39>] ? ktime_get_ts+0xa9/0xe0
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
> >> Sep 20 16:02:15 node004 kernel: [<ffffffffa094aaca>] gfs2_log_flush+0x47a/0x6f0 [gfs2]
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> >> Sep 20 16:02:15 node004 kernel: [<ffffffffa09477d0>] inode_go_sync+0x80/0x160 [gfs2]
> >> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946336>] do_xmote+0x156/0x280 [gfs2]
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
> >> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946551>] run_queue+0xf1/0x1d0 [gfs2]
> >> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946d2a>] glock_work_func+0x7a/0x1b0 [gfs2]
> >> Sep 20 16:02:15 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> >> Sep 20 16:02:15 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> >> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 16:02:21 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e c0 00 00 20 00
> >> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117976
> >> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
> >> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117977
> >> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
> >> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117978
> >> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
> >> Sep 20 16:02:21 node004 kernel: Buffer I/O error on device dm-6, logical block 117979
> >> Sep 20 16:02:21 node004 kernel: lost page write due to I/O error on dm-6
> >> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 16:02:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e b8 00 00 08 00
> >> Sep 20 16:02:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117975
> >> Sep 20 16:02:22 node004 kernel: lost page write due to I/O error on dm-6
> >> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 16:05:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 0e 6e e0 00 00 08 00
> >> Sep 20 16:05:22 node004 kernel: Buffer I/O error on device dm-6, logical block 117980
> >> Sep 20 16:05:22 node004 kernel: lost page write due to I/O error on dm-6
> >> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: fatal: I/O error
> >> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   block = 117980
> >> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3:   function = log_write_header, file = fs/gfs2/log.c, line = 616
> >> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: about to withdraw this file system
> >> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: telling LM to unmount
> >> Sep 20 16:05:22 node004 kernel: GFS2: fsid=nimble_cluster:gfs_test.3: withdrawn
> >> Sep 20 16:05:22 node004 kernel: Pid: 9820, comm: glock_workqueue Not tainted 2.6.32-279.el6.x86_64 #1
> >> Sep 20 16:05:22 node004 kernel: Call Trace:
> >> Sep 20 16:05:22 node004 kernel: [<ffffffffa0962062>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff814fea28>] ? out_of_line_wait_on_bit+0x78/0x90
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >> Sep 20 16:05:22 node004 kernel: [<ffffffffa09620d0>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff811adfb6>] ? __wait_on_buffer+0x26/0x30
> >> Sep 20 16:05:22 node004 kernel: [<ffffffffa094a288>] ? log_write_header+0x3a8/0x490 [gfs2]
> >> Sep 20 16:05:22 node004 kernel: [<ffffffffa094a951>] ? gfs2_log_flush+0x301/0x6f0 [gfs2]
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> >> Sep 20 16:05:22 node004 kernel: [<ffffffffa09477d0>] ? inode_go_sync+0x80/0x160 [gfs2]
> >> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946336>] ? do_xmote+0x156/0x280 [gfs2]
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
> >> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946551>] ? run_queue+0xf1/0x1d0 [gfs2]
> >> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946d2a>] ? glock_work_func+0x7a/0x1b0 [gfs2]
> >> Sep 20 16:05:22 node004 kernel: [<ffffffffa0946cb0>] ? glock_work_func+0x0/0x1b0 [gfs2]
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> >> Sep 20 16:05:22 node004 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> >> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: timing out command, waited 180s
> >> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Unhandled error code
> >> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> Sep 20 16:08:22 node004 kernel: sd 3:0:0:0: [sdi] CDB: Write(10): 2a 00 00 00 08 b0 00 00 08 00
> >> Sep 20 16:08:22 node004 kernel: Buffer I/O error on device dm-6, logical block 22
> >> Sep 20 16:08:22 node004 kernel: lost page write due to I/O error on dm-6
> >> Sep 20 16:16:05 node004 xinetd[4416]: START: node-state pid=14578 from=::ffff:10.141.255.254
> >> Sep 20 16:16:05 node004 xinetd[4416]: EXIT: node-state status=0 pid=14578 duration=0(sec)
> >> Sep 20 16:17:34 node004 xinetd[4416]: START: node-state pid=14653 from=::ffff:10.141.255.254
> >> Sep 20 16:17:34 node004 xinetd[4416]: EXIT: node-state status=0 pid=14653 duration=0(sec)
> >> Sep 20 16:17:36 node004 xinetd[4416]: START: node-state pid=14671 from=::ffff:10.141.255.254
> >> Sep 20 16:17:36 node004 xinetd[4416]: EXIT: node-state status=0 pid=14671 duration=0(sec)
> >> Sep 20 16:17:39 node004 xinetd[4416]: START: node-state pid=14690 from=::ffff:10.141.255.254
> >> Sep 20 16:17:39 node004 xinetd[4416]: EXIT: node-state status=0 pid=14690 duration=0(sec)
> >> Sep 20 16:17:41 node004 xinetd[4416]: START: node-state pid=14708 from=::ffff:10.141.255.254
> >> Sep 20 16:17:41 node004 xinetd[4416]: EXIT: node-state status=0 pid=14708 duration=0(sec)
> >> On Sep 20, 2012, at 4:14 PM, Andrew Holway wrote:
> >> 
> >>> Aslo,
> >>> 
> >>> IOzone gave this error: Error writing block 29813, fd= 3
> >>> 
> >>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Trying to acquire journal lock...
> >>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Looking at journal...
> >>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Acquiring the transaction lock...
> >>> GFS2: fsid=nimble_cluster:gfs_test.0: jid=3: Replaying journal...
> >>> 
> >>> 
> >>> GFS seemed to repair itself and things carried on working.
> >>> 
> >>> thanks,
> >>> 
> >>> Andrew
> >>> 
> >>> On Sep 20, 2012, at 4:08 PM, Andrew Holway wrote:
> >>> 
> >>>> Hello,
> >>>> 
> >>>> I have set up a 4 node cluster. They are interconnected with an IPoIB (connected mode)
> >>>> 
> >>>> Whist running a benchmark with IOzone I got the following errors:
> >>>> 
> >>>> IO seems to have halted.
> >>>> 
> >>>> Thanks,
> >>>> 
> >>>> Andrew
> >>>> 
> >>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15816 blocked for more than 120 seconds.
> >>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15816  15374 0x00000080
> >>>> Sep 20 16:01:57 node001 kernel: ffff880fd5ebbac8 0000000000000086 ffff880fd5ebba38 ffffffff81276b66
> >>>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ebba58 ffffffff81091f97
> >>>> Sep 20 16:01:57 node001 kernel: ffff880fe238c638 ffff880fd5ebbfd8 000000000000fb88 ffff880fe238c638
> >>>> Sep 20 16:01:57 node001 kernel: Call Trace:
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15818 blocked for more than 120 seconds.
> >>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15818  15374 0x00000080
> >>>> Sep 20 16:01:57 node001 kernel: ffff880fbe5b3ac8 0000000000000082 0000000000000000 ffff881ff95587a0
> >>>> Sep 20 16:01:57 node001 kernel: ffff881000000002 ffff88100ee13048 00000000be5b3a58 00000040ffffffff
> >>>> Sep 20 16:01:57 node001 kernel: ffff88100eed1ab8 ffff880fbe5b3fd8 000000000000fb88 ffff88100eed1ab8
> >>>> Sep 20 16:01:57 node001 kernel: Call Trace:
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15820 blocked for more than 120 seconds.
> >>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000008     0 15820  15374 0x00000080
> >>>> Sep 20 16:01:57 node001 kernel: ffff880ffed7bac8 0000000000000086 0000000000000000 ffffffff81276b66
> >>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000fed7ba58 00000040ffffffff
> >>>> Sep 20 16:01:57 node001 kernel: ffff880fbdd51ab8 ffff880ffed7bfd8 000000000000fb88 ffff880fbdd51ab8
> >>>> Sep 20 16:01:57 node001 kernel: Call Trace:
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15822 blocked for more than 120 seconds.
> >>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15822  15374 0x00000080
> >>>> Sep 20 16:01:57 node001 kernel: ffff880fd5dddac8 0000000000000086 0000000000000000 ffffffff81276b66
> >>>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fd5ddda58 ffffffff81091f97
> >>>> Sep 20 16:01:57 node001 kernel: ffff880fe238d098 ffff880fd5dddfd8 000000000000fb88 ffff880fe238d098
> >>>> Sep 20 16:01:57 node001 kernel: Call Trace:
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15824 blocked for more than 120 seconds.
> >>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15824  15374 0x00000080
> >>>> Sep 20 16:01:57 node001 kernel: ffff880fbe5edac8 0000000000000086 0000000000000000 ffffffff81276b66
> >>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000be5eda58 00000040ffffffff
> >>>> Sep 20 16:01:57 node001 kernel: ffff880ff69085f8 ffff880fbe5edfd8 000000000000fb88 ffff880ff69085f8
> >>>> Sep 20 16:01:57 node001 kernel: Call Trace:
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15826 blocked for more than 120 seconds.
> >>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15826  15374 0x00000080
> >>>> Sep 20 16:01:57 node001 kernel: ffff880fbe7cfac8 0000000000000086 0000000000000000 ffffffff81276b66
> >>>> Sep 20 16:01:57 node001 kernel: 0000000000000096 ffff881ff95588d8 ffff880fbe7cfa58 ffffffff81091f97
> >>>> Sep 20 16:01:57 node001 kernel: ffff88100ddcbab8 ffff880fbe7cffd8 000000000000fb88 ffff88100ddcbab8
> >>>> Sep 20 16:01:57 node001 kernel: Call Trace:
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15828 blocked for more than 120 seconds.
> >>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000011     0 15828  15374 0x00000080
> >>>> Sep 20 16:01:57 node001 kernel: ffff88100684bac8 0000000000000086 0000000000000000 ffffffff81276b66
> >>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 000000000684ba58 00000040ffffffff
> >>>> Sep 20 16:01:57 node001 kernel: ffff88100edc7af8 ffff88100684bfd8 000000000000fb88 ffff88100edc7af8
> >>>> Sep 20 16:01:57 node001 kernel: Call Trace:
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >>>> Sep 20 16:01:57 node001 kernel: INFO: task iozone:15830 blocked for more than 120 seconds.
> >>>> Sep 20 16:01:57 node001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>> Sep 20 16:01:57 node001 kernel: iozone        D 0000000000000000     0 15830  15374 0x00000080
> >>>> Sep 20 16:01:57 node001 kernel: ffff880fbdd0fac8 0000000000000082 0000000000000000 ffffffff81276b66
> >>>> Sep 20 16:01:57 node001 kernel: 0000000000000002 ffff881ff95588d8 00000000bdd0fa58 00000040ffffffff
> >>>> Sep 20 16:01:57 node001 kernel: ffff88100de93098 ffff880fbdd0ffd8 000000000000fb88 ffff88100de93098
> >>>> Sep 20 16:01:57 node001 kernel: Call Trace:
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81276b66>] ? __prop_inc_single+0x46/0x60
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa094357e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fe97f>] __wait_on_bit+0x5f/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa0943570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff814fea28>] out_of_line_wait_on_bit+0x78/0x90
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09454f5>] gfs2_glock_wait+0x45/0x90 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09468f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540ec>] gfs2_permission+0xec/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffffa09540e4>] ? gfs2_permission+0xe4/0x100 [gfs2]
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118982d>] __link_path_walk+0xad/0x1030
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118aa3a>] path_walk+0x6a/0xe0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118ac0b>] do_path_lookup+0x5b/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118b877>] user_path_at+0x57/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff811804ac>] vfs_fstatat+0x3c/0x80
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8118061b>] vfs_stat+0x1b/0x20
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff81180644>] sys_newstat+0x24/0x50
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
> >>>> Sep 20 16:01:57 node001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> >>>> 
> >>>> 
> >>> 
> >> 
> >> 
> >> 
> > 
> > 
> 
> 


From lists at alteeve.ca  Fri Sep 21 14:52:58 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 21 Sep 2012 10:52:58 -0400
Subject: [Linux-cluster] Problem with rgmanager / rgmanager #37: Error
 receiving header from 2 sz=0 CTX 0x1f5d420
In-Reply-To: <505C43D0.2080201@informatik.uni-stuttgart.de>
References: <505B429A.6050701@informatik.uni-stuttgart.de>
	<505B5840.4030200@alteeve.ca>
	<505C43D0.2080201@informatik.uni-stuttgart.de>
Message-ID: <505C7F4A.9080808@alteeve.ca>

On 09/21/2012 06:39 AM, Ralf Aumueller wrote:
> On 09/20/2012 07:54 PM, Digimer wrote:
>> On 09/20/2012 12:21 PM, Ralf Aumueller wrote:
>>> Hello,
>>>
>>> we have a two node CentOS6.2 Cluster (rgmanager-3.0.12.1-5). After a reboot of
>>> node2 the cluster won't work as expected. On node2 clustat just say's :
>>>
>>> clustat:
>>> Cluster Status for cluster1 @ Thu Sep 20 17:06:02 2012
>>> Member Status: Quorate
>>>
>>>    Member Name                                                 ID   Status
>>>    ------ ----                                                 ---- ------
>>>    node1                                                       1 Online
>>>    node2                                                       2 Online, Local
>>>
>>> No services listed, no rgmanager running. Also it is not possible to
>>> start/migrate any services to node2.
>>>
>>> On node1 a clustat lists all configured services + under Status rgmanager on
>>> both nodes. On node1 the rgmanager.log has lots of:
>>> rgmanager #37: Error receiving header from 2 sz=0 CTX 0x1XXXXXX
>>>
>>> On node2 the rgmanager.log gives me:
>>> rgmanager #34: Cannot get status for service ...
>>>
>>> I did not change the cluster.conf. Only change on node2 was: +48MB and an new
>>> BIOS version -- recommend by Dell Support).
>>>
>>> Best regards,
>>> Ralf
>>>
>>
>> Sounds like you hit this bug:
>> http://rhn.redhat.com/errata/RHBA-2012-0897.html
>>
>> Update rgmanager to rgmanager-3.0.12.1-12 and you should be ok.
>
> Did an update of rgmanager on both nodes. Just stopping/starting the
> cluster-services didn't revolve the problem. A shutdown of both nodes an then a
> restart solves the problem.
>
> Thanks and best regards,
> Ralf

If I recall correctly, I had to also reboot after the update was 
applied. Now that I've been able to remember better, I think this was 
caused by the leep second some time back. That leep second hit a lot of 
programs, and I believe this includes stuff in the kernel itself.

Glad it's resolved!

-- 
Digimer
Papers and Projects: https://alteeve.ca


From bentech4you at gmail.com  Sat Sep 22 21:53:44 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Sun, 23 Sep 2012 00:53:44 +0300
Subject: [Linux-cluster] How can i configure fence_cisco_ucs with redhat
	cluster suite
Message-ID: <CA+C_GOWXeOH0-+hNd3bDZf9gfCv+FvtN5AfGc=SQcyZdTz_Pzw@mail.gmail.com>

HI


How can i configure fence_cisco_ucs with redhat cluster suite.


i was trying to configure fencing aganet with red ahat cluster suite


i issued command


#  /usr/sbin/fence_vmware_soap -a 172.22.90.61 -l admin -p duc2Cisco -o
reboot


then i got below error:


Failed: You have to enter plug number

Please use '-h' for usage


what is this plug number and from where can i get this plug number.?


I am new to this Cisco UCS, becuase this hardware is provided my some other
team and i am the one installing Red Hat cluster suite. I requested fencing
agent configuration with Cisco, and that guy dosen't know what is fencing..


please help me, because of this issue, production got delayed


regards,

Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120923/1746f7db/attachment.htm>

From arpittolani at gmail.com  Sat Sep 22 22:24:57 2012
From: arpittolani at gmail.com (Arpit Tolani)
Date: Sun, 23 Sep 2012 03:54:57 +0530
Subject: [Linux-cluster] How can i configure fence_cisco_ucs with redhat
 cluster suite
In-Reply-To: <CA+C_GOWXeOH0-+hNd3bDZf9gfCv+FvtN5AfGc=SQcyZdTz_Pzw@mail.gmail.com>
References: <CA+C_GOWXeOH0-+hNd3bDZf9gfCv+FvtN5AfGc=SQcyZdTz_Pzw@mail.gmail.com>
Message-ID: <CAD3MydAc-+FhABQ1rTZpQEV0yH+RWMB1tPdn_0SZbAgW5G=-tg@mail.gmail.com>

Hello


On Sun, Sep 23, 2012 at 3:23 AM, Ben .T.George <bentech4you at gmail.com>wrote:

> HI
>
>
>
> How can i configure fence_cisco_ucs with redhat cluster suite.
>
>
>
> i was trying to configure fencing aganet with red ahat cluster suite
>
>
>
> i issued command
>
>
>
> #  /usr/sbin/fence_vmware_soap -a 172.22.90.61 -l admin -p duc2Cisco -o
> reboot
>
>
Use

$ fence_cisco_ucs -z --ip=1.1.1.1 --username=admin
--password=some_password --plug=SomePlug_1 --suborg=org-RHEL -v

If you dont want to use SSL, then remove -z Here is a sample fence devices
section:

<fencedevices>
                <fencedevice agent="fence_cisco_ucs" ipaddr="<IP
ADDR>" login="ucs-login" name="ucs" passwd="12345"
suborg="<organization>"/>
</fencedevice>


>
>
> then i got below error:
>
>
>
> Failed: You have to enter plug number
>
> Please use '-h' for usage
>
>
>
> what is this plug number and from where can i get this plug number.?
>

I am not sure, Someone will peek in & update.


>
>
> I am new to this Cisco UCS, becuase this hardware is provided my some
> other team and i am the one installing Red Hat cluster suite. I requested
> fencing agent configuration with Cisco, and that guy dosen't know what is
> fencing..
>
>
>
> please help me, because of this issue, production got delayed
>
>
>
> regards,
>
> Ben
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

Regards
Arpit Tolani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120923/5aadae06/attachment.htm>

From bentech4you at gmail.com  Sun Sep 23 02:26:17 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Sun, 23 Sep 2012 05:26:17 +0300
Subject: [Linux-cluster] How can i configure fence_cisco_ucs with redhat
 cluster suite
In-Reply-To: <CAD3MydAc-+FhABQ1rTZpQEV0yH+RWMB1tPdn_0SZbAgW5G=-tg@mail.gmail.com>
References: <CA+C_GOWXeOH0-+hNd3bDZf9gfCv+FvtN5AfGc=SQcyZdTz_Pzw@mail.gmail.com>
	<CAD3MydAc-+FhABQ1rTZpQEV0yH+RWMB1tPdn_0SZbAgW5G=-tg@mail.gmail.com>
Message-ID: <CA+C_GOVfAxc65A_eu16bYQXwapd5NEkzfW3AJZ9qSYD5rLoWHQ@mail.gmail.com>

HI


thanks for your replay


actually i tried both fence_cisco_ucs and fence_vmware_soap. which one is
used for fencing configuration..


i think both method asking for plug number.


i have access to UCS control panel. from where can i get these service or
plug number.


i tried te below, i don't i took one profile name from UCS cpanel and
checked:


/usr/sbin/fence_cisco_ucs -a 172.22.90.61 -l admin -p duc2Cisco -n
org-root/ls-cgceccprd1 -o reboot

Traceback (most recent call last):

  File "/usr/sbin/fence_cisco_ucs", line 147, in <module>

    main()

  File "/usr/sbin/fence_cisco_ucs", line 120, in main

    res = send_command(options, "<aaaLogin inName=\"" + options["-l"] + "\"
inPassword=\"" + options["-p"] + "\" />")

  File "/usr/sbin/fence_cisco_ucs", line 92, in send_command

    c.perform()

pycurl.error: (28, 'connect() timed out!')


that also got error


Thanks & regards,

Ben

On Sun, Sep 23, 2012 at 1:24 AM, Arpit Tolani <arpittolani at gmail.com> wrote:

> Hello
>
>
> On Sun, Sep 23, 2012 at 3:23 AM, Ben .T.George <bentech4you at gmail.com>wrote:
>
>>  HI
>>
>>
>>
>> How can i configure fence_cisco_ucs with redhat cluster suite.
>>
>>
>>
>> i was trying to configure fencing aganet with red ahat cluster suite
>>
>>
>>
>> i issued command
>>
>>
>>
>> #  /usr/sbin/fence_vmware_soap -a 172.22.90.61 -l admin -p duc2Cisco -o
>> reboot
>>
>>
> Use
>
> $ fence_cisco_ucs -z --ip=1.1.1.1 --username=admin --password=some_password --plug=SomePlug_1 --suborg=org-RHEL -v
>
> If you dont want to use SSL, then remove -z Here is a sample fence devices
> section:
>
> <fencedevices>
>                 <fencedevice agent="fence_cisco_ucs" ipaddr="<IP ADDR>" login="ucs-login" name="ucs" passwd="12345" suborg="<organization>"/>
> </fencedevice>
>
>
>
>
>>
>>
>> then i got below error:
>>
>>
>>
>> Failed: You have to enter plug number
>>
>> Please use '-h' for usage
>>
>>
>>
>> what is this plug number and from where can i get this plug number.?
>>
>
> I am not sure, Someone will peek in & update.
>
>
>
>>
>>
>> I am new to this Cisco UCS, becuase this hardware is provided my some
>> other team and i am the one installing Red Hat cluster suite. I requested
>> fencing agent configuration with Cisco, and that guy dosen't know what is
>> fencing..
>>
>>
>>
>> please help me, because of this issue, production got delayed
>>
>>
>>
>> regards,
>>
>> Ben
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> Regards
> Arpit Tolani
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120923/74650cb7/attachment.htm>

From bentech4you at gmail.com  Sun Sep 23 08:54:10 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Sun, 23 Sep 2012 11:54:10 +0300
Subject: [Linux-cluster] How can i configure fence_cisco_ucs with redhat
 cluster suite
In-Reply-To: <CAD3MydAc-+FhABQ1rTZpQEV0yH+RWMB1tPdn_0SZbAgW5G=-tg@mail.gmail.com>
References: <CA+C_GOWXeOH0-+hNd3bDZf9gfCv+FvtN5AfGc=SQcyZdTz_Pzw@mail.gmail.com>
	<CAD3MydAc-+FhABQ1rTZpQEV0yH+RWMB1tPdn_0SZbAgW5G=-tg@mail.gmail.com>
Message-ID: <CA+C_GOVLM0U7d53Fg2w0dEaYaE9bXBV1d_LBf8NEiWv=0MkhfA@mail.gmail.com>

HI

fence_cisco_ucs is using with one Ip address, from where can i get that IP
address.before i used same KVM IP..it it ok or i need to
configure separate IP


On Sun, Sep 23, 2012 at 1:24 AM, Arpit Tolani <arpittolani at gmail.com> wrote:

> Hello
>
>
> On Sun, Sep 23, 2012 at 3:23 AM, Ben .T.George <bentech4you at gmail.com>wrote:
>
>>  HI
>>
>>
>>
>> How can i configure fence_cisco_ucs with redhat cluster suite.
>>
>>
>>
>> i was trying to configure fencing aganet with red ahat cluster suite
>>
>>
>>
>> i issued command
>>
>>
>>
>> #  /usr/sbin/fence_vmware_soap -a 172.22.90.61 -l admin -p duc2Cisco -o
>> reboot
>>
>>
> Use
>
> $ fence_cisco_ucs -z --ip=1.1.1.1 --username=admin --password=some_password --plug=SomePlug_1 --suborg=org-RHEL -v
>
> If you dont want to use SSL, then remove -z Here is a sample fence devices
> section:
>
> <fencedevices>
>                 <fencedevice agent="fence_cisco_ucs" ipaddr="<IP ADDR>" login="ucs-login" name="ucs" passwd="12345" suborg="<organization>"/>
> </fencedevice>
>
>
>
>
>>
>>
>> then i got below error:
>>
>>
>>
>> Failed: You have to enter plug number
>>
>> Please use '-h' for usage
>>
>>
>>
>> what is this plug number and from where can i get this plug number.?
>>
>
> I am not sure, Someone will peek in & update.
>
>
>
>>
>>
>> I am new to this Cisco UCS, becuase this hardware is provided my some
>> other team and i am the one installing Red Hat cluster suite. I requested
>> fencing agent configuration with Cisco, and that guy dosen't know what is
>> fencing..
>>
>>
>>
>> please help me, because of this issue, production got delayed
>>
>>
>>
>> regards,
>>
>> Ben
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> Regards
> Arpit Tolani
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120923/8700835f/attachment.htm>

From bentech4you at gmail.com  Mon Sep 24 05:29:22 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Mon, 24 Sep 2012 08:29:22 +0300
Subject: [Linux-cluster] fence_cisco_ucs working from command line.but not
	from cluster.conf
Message-ID: <CA+C_GOUG0uG4mVLcsBUv1GHgw71DJMt8ao=OhdXen5svVCoUaw@mail.gmail.com>

Hi

from command line i issued :

/usr/sbin/fence_cisco_ucs -a 172.22.90.10 -l admin -p duc2Cisco -o status
-v -n node1 -z

is working and i got the output, even i give reboot also working

but i configured in the cluster.conf file with same configuration, like:

<clusternode name="node1.test.net" nodeid="1">
 <fence>
<method name="ucs-node1">
<device name="ucs-node1" port="node1"/>
 </method>
</fence>
</clusternode>


<fencedevices>
<fencedevice agent="fence_cisco_ucs" ipaddr="172.22.90.10" login="admin"
name="ucs-node1" passwd="duc2Cisco"/>
</fencedevices>


but to test fencing property.i turned off heartbeat network on node node1.
but i am getting error on /var/log/messages.


Sep 24 07:20:13 node1 fenced[6589]: fence node2.test.net dev 0.0 agent
fence_cisco_ucs result: error from agent Sep 24 07:20:13 node1
fenced[6589]: fence
node2 . test .net failed Sep 24 07:20:16 node1  fenced[6589]: fencing node
node2 . test .net Sep 24 07:20:17 node1  fenced[6589]: fence node2 . test .net
dev 0.0 agent fence_cisco_ucs result: error from agent Sep 24 07:20:17 node1
  fenced[6589]: fence node2 . test .net failed Sep 24 07:20:20 node1
fenced[6589]: fencing node
node2 . test .net Sep 24 07:20:20 node1  fenced[6589]: fence node2 . test .net
dev 0.0 agent fence_cisco_ucs result: error from agent Sep 24 07:20:20 node1
  fenced[6589]: fence node2 . test .net failed

please help me to solve this

regards,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120924/bd59f8c1/attachment.htm>

From lists at alteeve.ca  Mon Sep 24 14:52:17 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 24 Sep 2012 10:52:17 -0400
Subject: [Linux-cluster] fence_cisco_ucs working from command line.but
 not from cluster.conf
In-Reply-To: <CA+C_GOUG0uG4mVLcsBUv1GHgw71DJMt8ao=OhdXen5svVCoUaw@mail.gmail.com>
References: <CA+C_GOUG0uG4mVLcsBUv1GHgw71DJMt8ao=OhdXen5svVCoUaw@mail.gmail.com>
Message-ID: <506073A1.3000202@alteeve.ca>

What is the exit code from the manual fence call?

/usr/sbin/fence_cisco_ucs -a 172.22.90.10 -l admin -p duc2Cisco -o 
status -v -n node1 -z; echo $?

The exit codes must follow the API rules. Read; 
https://fedorahosted.org/cluster/wiki/FenceAgentAPI#agent_ops

digimer

On 09/24/2012 01:29 AM, Ben .T.George wrote:
> Hi
>
> from command line i issued :
>
> /usr/sbin/fence_cisco_ucs -a 172.22.90.10 -l admin -p duc2Cisco -o
> status -v -n node1 -z
>
> is working and i got the output, even i give reboot also working
>
> but i configured in the cluster.conf file with same configuration, like:
>
> <clusternode name="node1.test.net <http://node1.test.net>" nodeid="1">
> <fence>
> <method name="ucs-node1">
> <device name="ucs-node1" port="node1"/>
> </method>
> </fence>
> </clusternode>
>
>
> <fencedevices>
> <fencedevice agent="fence_cisco_ucs" ipaddr="172.22.90.10" login="admin"
> name="ucs-node1" passwd="duc2Cisco"/>
> </fencedevices>
>
>
> but to test fencing property.i turned off heartbeat network on node
> node1. but i am getting error on /var/log/messages.
>
>
> Sep 24 07:20:13 node1 fenced[6589]: fence node2.test.net
> <http://node2.test.net> dev 0.0 agent fence_cisco_ucs result: error from
> agent Sep 24 07:20:13 node1 fenced[6589]: fence node2 . test .net failed
> Sep 24 07:20:16 node1 fenced[6589]: fencing node node2 . test .net Sep
> 24 07:20:17 node1 fenced[6589]: fence node2 . test .net dev 0.0 agent
> fence_cisco_ucs result: error from agent Sep 24 07:20:17 node1
> fenced[6589]: fence node2 . test .net failed Sep 24 07:20:20 node1
> fenced[6589]: fencing node node2 . test .net Sep 24 07:20:20 node1
> fenced[6589]: fence node2 . test .net dev 0.0 agent fence_cisco_ucs
> result: error from agent Sep 24 07:20:20 node1 fenced[6589]: fence node2
> . test .net failed
>
> please help me to solve this
>
> regards,
> Ben
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca


From bentech4you at gmail.com  Mon Sep 24 15:28:43 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Mon, 24 Sep 2012 18:28:43 +0300
Subject: [Linux-cluster] fence_cisco_ucs working from command line.but
 not from cluster.conf
In-Reply-To: <506073A1.3000202@alteeve.ca>
References: <CA+C_GOUG0uG4mVLcsBUv1GHgw71DJMt8ao=OhdXen5svVCoUaw@mail.gmail.com>
	<506073A1.3000202@alteeve.ca>
Message-ID: <CA+C_GOWMFHvXJSHFbeaDgLCXRJEi_T5HOTd+521FqgNaLCYkQg@mail.gmail.com>

Hi

issued that status command. Please check below output, [root at cgceccprd2 ~]#
/usr/sbin/fence_cisco_ucs -a 172.22.90.10 -l admin -p duc2Cisco -o status
-v -n cgceccprd1 -z <aaaLogin inName="admin" inPassword="duc2Cisco" />
<aaaLogin cookie="" response="yes"
outCookie="1348431691/9aae0d0f-7c05-4d47-a540-a4db8b3c6d45"
outRefreshPeriod="600" outPriv="admin,read-only" outDomains=""
outChannel="noencssl" outEvtChannel="noencssl" outSessionId="web_46809_B"
outVersion="2.0(2q)"> </aaaLogin> <configResolveDn
cookie="1348431691/9aae0d0f-7c05-4d47-a540-a4db8b3c6d45"
inHierarchical="false" dn="org-root/ls-cgceccprd1/power"/> <configResolveDn
dn="org-root/ls-cgceccprd1/power"
cookie="1348431691/9aae0d0f-7c05-4d47-a540-a4db8b3c6d45" response="yes">
<outConfig> <lsPower dn="org-root/ls-cgceccprd1/power" state="up" />
</outConfig> </configResolveDn> Status: ON <aaaLogout
inCookie="1348431691/9aae0d0f-7c05-4d47-a540-a4db8b3c6d45" /> <aaaLogout
cookie="" response="yes" outStatus="success"> </aaaLogout>


even i tried reboot option. That also worked.but from cluster.conf it's not
working


Regards,
Ben

On Mon, Sep 24, 2012 at 5:52 PM, Digimer <lists at alteeve.ca> wrote:

> What is the exit code from the manual fence call?
>
> /usr/sbin/fence_cisco_ucs -a 172.22.90.10 -l admin -p duc2Cisco -o status
> -v -n node1 -z; echo $?
>
> The exit codes must follow the API rules. Read; https://fedorahosted.org/*
> *cluster/wiki/FenceAgentAPI#**agent_ops<https://fedorahosted.org/cluster/wiki/FenceAgentAPI#agent_ops>
>
> digimer
>
>
> On 09/24/2012 01:29 AM, Ben .T.George wrote:
>
>> Hi
>>
>> from command line i issued :
>>
>> /usr/sbin/fence_cisco_ucs -a 172.22.90.10 -l admin -p duc2Cisco -o
>> status -v -n node1 -z
>>
>> is working and i got the output, even i give reboot also working
>>
>> but i configured in the cluster.conf file with same configuration, like:
>>
>> <clusternode name="node1.test.net <http://node1.test.net>" nodeid="1">
>>
>> <fence>
>> <method name="ucs-node1">
>> <device name="ucs-node1" port="node1"/>
>> </method>
>> </fence>
>> </clusternode>
>>
>>
>> <fencedevices>
>> <fencedevice agent="fence_cisco_ucs" ipaddr="172.22.90.10" login="admin"
>> name="ucs-node1" passwd="duc2Cisco"/>
>> </fencedevices>
>>
>>
>> but to test fencing property.i turned off heartbeat network on node
>> node1. but i am getting error on /var/log/messages.
>>
>>
>> Sep 24 07:20:13 node1 fenced[6589]: fence node2.test.net
>> <http://node2.test.net> dev 0.0 agent fence_cisco_ucs result: error from
>>
>> agent Sep 24 07:20:13 node1 fenced[6589]: fence node2 . test .net failed
>> Sep 24 07:20:16 node1 fenced[6589]: fencing node node2 . test .net Sep
>> 24 07:20:17 node1 fenced[6589]: fence node2 . test .net dev 0.0 agent
>> fence_cisco_ucs result: error from agent Sep 24 07:20:17 node1
>> fenced[6589]: fence node2 . test .net failed Sep 24 07:20:20 node1
>> fenced[6589]: fencing node node2 . test .net Sep 24 07:20:20 node1
>> fenced[6589]: fence node2 . test .net dev 0.0 agent fence_cisco_ucs
>> result: error from agent Sep 24 07:20:20 node1 fenced[6589]: fence node2
>> . test .net failed
>>
>> please help me to solve this
>>
>> regards,
>> Ben
>>
>>
>>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120924/54f137bb/attachment.htm>

From lists at alteeve.ca  Mon Sep 24 15:44:10 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 24 Sep 2012 11:44:10 -0400
Subject: [Linux-cluster] fence_cisco_ucs working from command line.but
 not from cluster.conf
In-Reply-To: <CA+C_GOWMFHvXJSHFbeaDgLCXRJEi_T5HOTd+521FqgNaLCYkQg@mail.gmail.com>
References: <CA+C_GOUG0uG4mVLcsBUv1GHgw71DJMt8ao=OhdXen5svVCoUaw@mail.gmail.com>
	<506073A1.3000202@alteeve.ca>
	<CA+C_GOWMFHvXJSHFbeaDgLCXRJEi_T5HOTd+521FqgNaLCYkQg@mail.gmail.com>
Message-ID: <50607FCA.6030807@alteeve.ca>

Please re-read my last email. You didn't answer the question I asked.

digimer

On 09/24/2012 11:28 AM, Ben .T.George wrote:
> Hi
>
> issued that status command. Please check below output, [root at cgceccprd2
> ~]# /usr/sbin/fence_cisco_ucs -a 172.22.90.10 -l admin -p duc2Cisco -o
> status -v -n cgceccprd1 -z <aaaLogin inName="admin"
> inPassword="duc2Cisco" /> <aaaLogin cookie="" response="yes"
> outCookie="1348431691/9aae0d0f-7c05-4d47-a540-a4db8b3c6d45"
> outRefreshPeriod="600" outPriv="admin,read-only" outDomains=""
> outChannel="noencssl" outEvtChannel="noencssl"
> outSessionId="web_46809_B" outVersion="2.0(2q)"> </aaaLogin>
> <configResolveDn
> cookie="1348431691/9aae0d0f-7c05-4d47-a540-a4db8b3c6d45"
> inHierarchical="false" dn="org-root/ls-cgceccprd1/power"/>
> <configResolveDn dn="org-root/ls-cgceccprd1/power"
> cookie="1348431691/9aae0d0f-7c05-4d47-a540-a4db8b3c6d45" response="yes">
> <outConfig> <lsPower dn="org-root/ls-cgceccprd1/power" state="up" />
> </outConfig> </configResolveDn> Status: ON <aaaLogout
> inCookie="1348431691/9aae0d0f-7c05-4d47-a540-a4db8b3c6d45" /> <aaaLogout
> cookie="" response="yes" outStatus="success"> </aaaLogout>
>
>
> even i tried reboot option. That also worked.but from cluster.conf it's
> not working
>
>
> Regards,
> Ben
>
> On Mon, Sep 24, 2012 at 5:52 PM, Digimer <lists at alteeve.ca
> <mailto:lists at alteeve.ca>> wrote:
>
>     What is the exit code from the manual fence call?
>
>     /usr/sbin/fence_cisco_ucs -a 172.22.90.10 -l admin -p duc2Cisco -o
>     status -v -n node1 -z; echo $?
>
>     The exit codes must follow the API rules. Read;
>     https://fedorahosted.org/__cluster/wiki/FenceAgentAPI#__agent_ops
>     <https://fedorahosted.org/cluster/wiki/FenceAgentAPI#agent_ops>
>
>     digimer
>
>
>     On 09/24/2012 01:29 AM, Ben .T.George wrote:
>
>         Hi
>
>         from command line i issued :
>
>         /usr/sbin/fence_cisco_ucs -a 172.22.90.10 -l admin -p duc2Cisco -o
>         status -v -n node1 -z
>
>         is working and i got the output, even i give reboot also working
>
>         but i configured in the cluster.conf file with same
>         configuration, like:
>
>         <clusternode name="node1.test.net <http://node1.test.net>
>         <http://node1.test.net>" nodeid="1">
>
>         <fence>
>         <method name="ucs-node1">
>         <device name="ucs-node1" port="node1"/>
>         </method>
>         </fence>
>         </clusternode>
>
>
>         <fencedevices>
>         <fencedevice agent="fence_cisco_ucs" ipaddr="172.22.90.10"
>         login="admin"
>         name="ucs-node1" passwd="duc2Cisco"/>
>         </fencedevices>
>
>
>         but to test fencing property.i turned off heartbeat network on node
>         node1. but i am getting error on /var/log/messages.
>
>
>         Sep 24 07:20:13 node1 fenced[6589]: fence node2.test.net
>         <http://node2.test.net>
>         <http://node2.test.net> dev 0.0 agent fence_cisco_ucs result:
>         error from
>
>         agent Sep 24 07:20:13 node1 fenced[6589]: fence node2 . test
>         .net failed
>         Sep 24 07:20:16 node1 fenced[6589]: fencing node node2 . test
>         .net Sep
>         24 07:20:17 node1 fenced[6589]: fence node2 . test .net dev 0.0
>         agent
>         fence_cisco_ucs result: error from agent Sep 24 07:20:17 node1
>         fenced[6589]: fence node2 . test .net failed Sep 24 07:20:20 node1
>         fenced[6589]: fencing node node2 . test .net Sep 24 07:20:20 node1
>         fenced[6589]: fence node2 . test .net dev 0.0 agent fence_cisco_ucs
>         result: error from agent Sep 24 07:20:20 node1 fenced[6589]:
>         fence node2
>         . test .net failed
>
>         please help me to solve this
>
>         regards,
>         Ben
>
>
>
>
>     --
>     Digimer
>     Papers and Projects: https://alteeve.ca
>
>
>
>
> --
> Yours Sincerely
>
> *#!/usr/bin/env python
> #Mysignature.py :)*
>
> Signature = " " " Ben.T.George \n
>                    Linux System Administrator \n
>                    Diyar United Company \n
>                    kuwait \n
>                    Phone : +965 - 50629829 \n " ""
>
> Print Signature
>


-- 
Digimer
Papers and Projects: https://alteeve.ca


From bentech4you at gmail.com  Thu Sep 27 09:36:33 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Thu, 27 Sep 2012 12:36:33 +0300
Subject: [Linux-cluster] How can i Monitor Red Hat Cluster suite with 2 nodes
Message-ID: <CA+C_GOWHG-riGV+naiJK_NgUs5_Ru7kogShx8aLgz+GXajFsog@mail.gmail.com>

HI

How can i Monitor Red Hat Cluster suite with 2 nodes. i am looking for web
based monitoring tools.

regards,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120927/217321a2/attachment.htm>

From misch at schwartzkopff.org  Thu Sep 27 09:59:26 2012
From: misch at schwartzkopff.org (Michael Schwartzkopff)
Date: Thu, 27 Sep 2012 11:59:26 +0200
Subject: [Linux-cluster] How can i Monitor Red Hat Cluster suite with 2
	nodes
In-Reply-To: <CA+C_GOWHG-riGV+naiJK_NgUs5_Ru7kogShx8aLgz+GXajFsog@mail.gmail.com>
References: <CA+C_GOWHG-riGV+naiJK_NgUs5_Ru7kogShx8aLgz+GXajFsog@mail.gmail.com>
Message-ID: <201209271159.26653.misch@schwartzkopff.org>

> HI
> 
> How can i Monitor Red Hat Cluster suite with 2 nodes. i am looking for web
> based monitoring tools.
> 
> regards,
> Ben

OpenSource based Network Management tools:
- nagios
- Zabbix
- OpenNMS

All three have a WebGUI

I would use SNMP protocol to monitor the state of the cluster. You can use the 
"extend" feature of the net-snmp agent of Linux.

Greetings,

-- 
Dr. Michael Schwartzkopff
Guardinistr. 63
81375 M?nchen

Tel: (0163) 172 50 98
Fax: (089) 620 304 13
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120927/5ab8be2f/attachment.sig>

From a.holway at syseleven.de  Thu Sep 27 11:05:38 2012
From: a.holway at syseleven.de (Andrew Holway)
Date: Thu, 27 Sep 2012 13:05:38 +0200
Subject: [Linux-cluster] How can i Monitor Red Hat Cluster suite with 2
	nodes
In-Reply-To: <CA+C_GOWHG-riGV+naiJK_NgUs5_Ru7kogShx8aLgz+GXajFsog@mail.gmail.com>
References: <CA+C_GOWHG-riGV+naiJK_NgUs5_Ru7kogShx8aLgz+GXajFsog@mail.gmail.com>
Message-ID: <0C580157-343C-4BC4-B0FE-357058EE50E4@syseleven.de>

Hi,

Did you have a look at luci?

Thanks,

Andrew
On Sep 27, 2012, at 11:36 AM, Ben .T.George wrote:

> HI
> 
> How can i Monitor Red Hat Cluster suite with 2 nodes. i am looking for web based monitoring tools.
> 
> regards,
> Ben
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From bentech4you at gmail.com  Thu Sep 27 13:26:13 2012
From: bentech4you at gmail.com (Ben .T.George)
Date: Thu, 27 Sep 2012 16:26:13 +0300
Subject: [Linux-cluster] How can i Monitor Red Hat Cluster suite with 2
	nodes
In-Reply-To: <0C580157-343C-4BC4-B0FE-357058EE50E4@syseleven.de>
References: <CA+C_GOWHG-riGV+naiJK_NgUs5_Ru7kogShx8aLgz+GXajFsog@mail.gmail.com>
	<0C580157-343C-4BC4-B0FE-357058EE50E4@syseleven.de>
Message-ID: <CA+C_GOWafWLu3JuUC4zeOs-x9kbRGPoLeH9gxYiVJKYrhbUmxA@mail.gmail.com>

HI

yes i have luci .but on luci i can view cluster nodes belongs to that
cluster only noe.?

i have 2 different clusters. I am looking some thing like nagios..

apart from that i have 14 more linux machines..i think nagios is the best
option

but on nagios how can i configure cluster service. i know only basics
of that. I have already a simple configuration with default scripts.

which script i need to use with check_nrpe to fetch the information from
cluster nodes.

i don't have much idea about SNMP and other stuffs.So please provide me
more details

Regards,
Ben


On Thu, Sep 27, 2012 at 2:05 PM, Andrew Holway <a.holway at syseleven.de>wrote:

> Hi,
>
> Did you have a look at luci?
>
> Thanks,
>
> Andrew
> On Sep 27, 2012, at 11:36 AM, Ben .T.George wrote:
>
> > HI
> >
> > How can i Monitor Red Hat Cluster suite with 2 nodes. i am looking for
> web based monitoring tools.
> >
> > regards,
> > Ben
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>


-- 
Yours Sincerely

*#!/usr/bin/env python
#Mysignature.py :)*

Signature = " " " Ben.T.George \n
                  Linux System Administrator \n
                  Diyar United Company \n
                  kuwait \n
                  Phone : +965 - 50629829 \n " " "

Print Signature
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120927/e0bb2778/attachment.htm>

From jpokorny at redhat.com  Thu Sep 27 17:52:32 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Thu, 27 Sep 2012 19:52:32 +0200
Subject: [Linux-cluster] How can i Monitor Red Hat Cluster suite with 2
 nodes
In-Reply-To: <CA+C_GOWafWLu3JuUC4zeOs-x9kbRGPoLeH9gxYiVJKYrhbUmxA@mail.gmail.com>
References: <CA+C_GOWHG-riGV+naiJK_NgUs5_Ru7kogShx8aLgz+GXajFsog@mail.gmail.com>
	<0C580157-343C-4BC4-B0FE-357058EE50E4@syseleven.de>
	<CA+C_GOWafWLu3JuUC4zeOs-x9kbRGPoLeH9gxYiVJKYrhbUmxA@mail.gmail.com>
Message-ID: <20120927175232.GD775@redhat.com>

>> On Sep 27, 2012, at 11:36 AM, Ben .T.George wrote:
>>> How can i Monitor Red Hat Cluster suite with 2 nodes. i am looking for
>>> web based monitoring tools.
>
> On Thu, Sep 27, 2012 at 2:05 PM, Andrew Holway <a.holway at syseleven.de> wrote:
>> Did you have a look at luci?

On 27/09/12 16:26 +0300, Ben .T.George wrote:
> yes i have luci .but on luci i can view cluster nodes belongs to that cluster
> only noe.?
> 
> i have 2 different clusters.

In luci, you can manage both, provided that the nodes are reachable
from the machine running this luci instance.

> I am looking some thing like nagios..
> 
> apart from that i have 14 more linux machines..i think nagios is the best
> option
> 
> but on nagios how can i configure cluster service. i know only basics of that.
> I have already a simple configuration with default scripts.
> 
> which script i need to use with check_nrpe to fetch the information from
> cluster nodes.

You may want to have a look at command-line utilities such as clustat
and parse the output.

> i don't have much idea about SNMP and other stuffs.So please provide me more
> details

For very basic snmp get-based monitoring, you may be interested in
cluster-snmp package (for setup, you may have a look at respective README:
"rpm -qd cluster-snmp | grep snmpd").  Or there is Foghorn.


Hope this helps,
Jan


From Ralph.Grothe at itdz-berlin.de  Fri Sep 28 08:18:36 2012
From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de)
Date: Fri, 28 Sep 2012 10:18:36 +0200
Subject: [Linux-cluster] How can i Monitor Red Hat Cluster suite with 2
	nodes
In-Reply-To: <CA+C_GOWHG-riGV+naiJK_NgUs5_Ru7kogShx8aLgz+GXajFsog@mail.gmail.com>
References: <CA+C_GOWHG-riGV+naiJK_NgUs5_Ru7kogShx8aLgz+GXajFsog@mail.gmail.com>
Message-ID: <A789DDB53ED7E94396E842EE2AC9B5FF01432CE0@itdzex101.ITDZ.verwalt-berlin.de>

_______________________________

	From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ben
.T.George
	Sent: Thursday, September 27, 2012 11:37
	To: linux-cluster
	Subject: [Linux-cluster] How can i Monitor Red Hat
Cluster suite with 2 nodes
	
	
	HI
	
	How can i Monitor Red Hat Cluster suite with 2 nodes. i
am looking for web based monitoring tools.
	
	regards,
	Ben
	

Hi Ben,


have you had a search at the Monitoring Exchange website yet?

It is a kind of repository / exchange platform for third party
Nagios, Icinga, Shinken etc. plug-ins for virtually every
conceivable kind of service check.

I have just looked and were output several plug-ins for RHCS
clusters.

For instance this plug-in written in Python: 
https://www.monitoringexchange.org/inventory/Check-Plugins/Softwa
re/Red-Hat-Cluster-Suite-Check

I haven't deployed it myself, but just now had a short glimps at
it.
Since I am more into Perl I some time ago installed another
similar Plug-in from that source, and had written my own stuff,
which I haven't shared since I don't want to support it or being
held liable for the code.
But I know sufficient Python to read what it's doing.
Basically, it simply parses the clustat output dumped in XML
(i.e. using "-f").
Therefore the plug-in needs to pull in an XML parser from another
Python module.
My older Perl variant simply parsed the usual textual output of
clustat.
With my own stuff I made use of the abundance of features of the
check_multi Plug-in (of course, also written in Perl).
Also, as its name implies, it collects several service checks on
a Nagios monitored host (such as cluster a node or cluster
service (i.e. a VIP)) to reduce the overhead of connecting and
querying a daemon like nrpe for each check individually.

I would suggest to give the several third party Nagios plug-ins
from Monitoring Exchange a try and stay with the one that meets
your requirements best.
Without having looked at all the others I would assume that they
generally rely on either a Perl, Python or Ruby interpreter on
the monitored host.
These (maybe apart from Ruby) are usally already installed on a
RHEL host per default.
Sometimes, if a plug-in requires more exotic modules/libraries 
(see the list of modules following the "import" or "use"
statements in the plug-in's header, but usually the interpreter
will complain about missing modules when invoked on the plug-in)
you need to install these on the monitored host as well.

Have you had a look at the NRPE section in the Nagios
documentation yet?
If you installed and configured your Nagios web interface
correctly you will always have this documentation readily
available with no need for an Internet connection.
If not, I should advise you to read it to find out how to set up
nrpe on your cluster nodes (only a matter of minutes).

Another option to have a Nagios plug-in executed on a remote host
would be to make use of the check_by_ssh plug-in that comes with
the standard Nagios plug-ins.
This requires the distribution of public RSA or DSA keys in the
$HOME/.ssh of an account on your cluster nodes that has
sufficient privilages to execute the system commands of the
plug-in.
If elevated privileges are required for that, usually sudo is an
option.

Yet another way to get check results of plug-ins to the Nagios
server would be, as Michael already referred to, to use SNMP.
That's usually fiddling with the snmpd on the cluster nodes to
extend it to run some script or the like, I guess.
But I cannot further comment on that as in our network
environment SNMP for us mere OS sysadmins is frowned upon, or
rather made impossible (viz. by firewalls).
Honestly, I find this a bit schizophrenic since our network admin
folks make ample use of SNMP to monitor their network gear.
But in a company the choice of monitoring tool and which plug-ins
to use for it often is dictated by policy.


Hope that would help a little

Ralph


From gianluca.cecchi at gmail.com  Fri Sep 28 09:40:30 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Fri, 28 Sep 2012 11:40:30 +0200
Subject: [Linux-cluster] Info on quorum rename on rhcs 5.8
Message-ID: <CAG2kNCwrgWOSsUNMUetfe5GkCDSTagApLuyDtHRQB7JaE1L5kw@mail.gmail.com>

Hello,
I saw some discussions in the past regarding the subject, such as:
https://www.redhat.com/archives/cluster-devel/2011-September/msg00027.html

My config is with an rhcs based on  rhel 5.8 two nodes + quorum disk cluster and
cman-2.0.115-96.el5_8.3
openais-0.80.6-36.el5_8.2
rgmanager-2.0.52-28.el5_8.2

I'm evaluating SAN migration from a storage array to another,
including data disks and quorum disks, limiting downtime and
considering pros&cons of several scenarios.
One of the problems is quorum disk moving.

I start in a situation where the nodes are seeing both the SANs and
cluster data disks (in HA-LVM config) are already on new storage while
quorum not.

My test approach is this:
- stop qdiskd on first node and wait for stabilization
second node becomes master
- stop qdiskd on second node and wait for stabilization

- cluster maintains quorum (2 votes instead of expected 3 for each server)

- format the new quorum lun with a different label
- change cluster.conf accordingly and update/distribute the config with
<quorumd device="/dev/mapper/mpnquorum" interval="5" label="nquorum"
log_facility="local4" log_level="7" tko="16" votes="1">
- start qdiskd on first node and wait for stabilization
it becomes master
- start qdiskd on second node and wait for stabilization

>From qdiskd point view all is gong well but openais shows:
Sep 28 11:21:00 noracs1 openais[7671]: [CMAN ] unable to re-register
quorum device: device names do not match

Complete messages in qdiskd.log I set up are these, after start of
qdiskd on the two nodes:

Sep 28 11:16:35 noracs2 qdiskd[9422]: <debug> Heuristic: 'ping -c1 -w1
10.4.5.250' score=1 interval=2 tko=200
Sep 28 11:16:35 noracs2 qdiskd[9422]: <debug> 1 heuristics loaded
Sep 28 11:16:35 noracs2 qdiskd[9422]: <debug> Quorum Daemon: 1
heuristics, 5 interval, 16 tko, 1 votes
Sep 28 11:16:35 noracs2 qdiskd[9422]: <debug> Run Flags: 00000031
Sep 28 11:16:35 noracs2 qdiskd[9422]: <info> Quorum Partition:
/dev/dm-6 Label: nquorum
Sep 28 11:16:35 noracs2 qdiskd[9423]: <info> Quorum Daemon Initializing
Sep 28 11:16:35 noracs2 qdiskd[9423]: <debug> I/O Size: 512  Page Size: 4096
Sep 28 11:16:36 noracs2 qdiskd[9423]: <info> Heuristic: 'ping -c1 -w1
10.4.5.250' UP
Sep 28 11:17:55 noracs2 qdiskd[9423]: <info> Initial score 1/1
Sep 28 11:17:55 noracs2 qdiskd[9423]: <info> Initialization complete
Sep 28 11:17:55 noracs2 openais[7653]: [CMAN ] unable to re-register
quorum device: device names do not match
Sep 28 11:17:55 noracs2 qdiskd[9423]: <notice> Score sufficient for
master operation (1/1; required=1); upgrading
Sep 28 11:18:05 noracs2 qdiskd[9423]: <debug> Making bid for master
Sep 28 11:18:40 noracs2 qdiskd[9423]: <info> Assuming master role
Sep 28 11:20:05 noracs2 qdiskd[9423]: <debug> Node 1 is UP

Sep 28 11:19:40 noracs1 qdiskd[3779]: <debug> Heuristic: 'ping -c1 -w1
10.4.5.250' score=1 interval=2 tko=200
Sep 28 11:19:40 noracs1 qdiskd[3779]: <debug> 1 heuristics loaded
Sep 28 11:19:40 noracs1 qdiskd[3779]: <debug> Quorum Daemon: 1
heuristics, 5 interval, 16 tko, 1 votes
Sep 28 11:19:40 noracs1 qdiskd[3779]: <debug> Run Flags: 00000031
Sep 28 11:19:40 noracs1 qdiskd[3779]: <info> Quorum Partition:
/dev/dm-5 Label: nquorum
Sep 28 11:19:40 noracs1 qdiskd[3780]: <info> Quorum Daemon Initializing
Sep 28 11:19:40 noracs1 qdiskd[3780]: <debug> I/O Size: 512  Page Size: 4096
Sep 28 11:19:41 noracs1 qdiskd[3780]: <info> Heuristic: 'ping -c1 -w1
10.4.5.250' UP
Sep 28 11:20:05 noracs1 qdiskd[3780]: <debug> Node 2 is UP
Sep 28 11:20:10 noracs1 qdiskd[3780]: <info> Node 2 is the master
Sep 28 11:21:00 noracs1 qdiskd[3780]: <info> Initial score 1/1
Sep 28 11:21:00 noracs1 qdiskd[3780]: <info> Initialization complete
Sep 28 11:21:00 noracs1 openais[7671]: [CMAN ] unable to re-register
quorum device: device names do not match
Sep 28 11:21:00 noracs1 qdiskd[3780]: <notice> Score sufficient for
master operation (1/1; required=1); upgrading

status reported is this:
[root at noracs1 ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: clnacs
Cluster Id: 6477
Cluster Member: Yes
Cluster Generation: 56
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 9
Flags: Dirty
Ports Bound: 0 177
Node name: iclnacs1
Node ID: 1
Multicast addresses: 239.192.25.102
Node addresses: 192.168.16.42

[root at noracs2 ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: clnacs
Cluster Id: 6477
Cluster Member: Yes
Cluster Generation: 56
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 9
Flags: Dirty
Ports Bound: 0 177
Node name: iclnacs2
Node ID: 2
Multicast addresses: 239.192.25.102
Node addresses: 192.168.16.43

So it seems ok.
This cluster is for testing with one service running now on node 1 so
I can shutdown the node 2 and verify if it maintains correctly  the
quorum...

But from an approach point of view are my steps correct/supported in
my 5.8 config?
Did it change anything in later versions such as rh el 6.3 or Fedora 17?
Thanks in advance

Gianluca


From fdinitto at redhat.com  Fri Sep 28 10:35:26 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 28 Sep 2012 12:35:26 +0200
Subject: [Linux-cluster] Info on quorum rename on rhcs 5.8
In-Reply-To: <CAG2kNCwrgWOSsUNMUetfe5GkCDSTagApLuyDtHRQB7JaE1L5kw@mail.gmail.com>
References: <CAG2kNCwrgWOSsUNMUetfe5GkCDSTagApLuyDtHRQB7JaE1L5kw@mail.gmail.com>
Message-ID: <50657D6E.2060803@redhat.com>

On 9/28/2012 11:40 AM, Gianluca Cecchi wrote:
> Hello,
> I saw some discussions in the past regarding the subject, such as:
> https://www.redhat.com/archives/cluster-devel/2011-September/msg00027.html
> 
> My config is with an rhcs based on  rhel 5.8 two nodes + quorum disk cluster and
> cman-2.0.115-96.el5_8.3
> openais-0.80.6-36.el5_8.2
> rgmanager-2.0.52-28.el5_8.2
> 
> I'm evaluating SAN migration from a storage array to another,
> including data disks and quorum disks, limiting downtime and
> considering pros&cons of several scenarios.
> One of the problems is quorum disk moving.
> 
> I start in a situation where the nodes are seeing both the SANs and
> cluster data disks (in HA-LVM config) are already on new storage while
> quorum not.
> 
> My test approach is this:
> - stop qdiskd on first node and wait for stabilization
> second node becomes master
> - stop qdiskd on second node and wait for stabilization
> 
> - cluster maintains quorum (2 votes instead of expected 3 for each server)
> 
> - format the new quorum lun with a different label
> - change cluster.conf accordingly and update/distribute the config with
> <quorumd device="/dev/mapper/mpnquorum" interval="5" label="nquorum"
> log_facility="local4" log_level="7" tko="16" votes="1">
> - start qdiskd on first node and wait for stabilization
> it becomes master
> - start qdiskd on second node and wait for stabilization
> 
>>From qdiskd point view all is gong well but openais shows:
> Sep 28 11:21:00 noracs1 openais[7671]: [CMAN ] unable to re-register
> quorum device: device names do not match

It?s a lot simpler if you do:

- drop <quorumd device="..." and just use label=
  (this is a general thing)
- stop qdiskd on the nodes as you also suggested
- format the new qdiskd device/LUN with the current label
- format the old qdiskd device/LUN with another label
- restart qdiskd

qdiskd will find the device by label and start using the new one
automatically.

No need for cluster.conf updates, qdiskd downtime is minimal.

Fabio


From gianluca.cecchi at gmail.com  Fri Sep 28 12:47:42 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Fri, 28 Sep 2012 14:47:42 +0200
Subject: [Linux-cluster] Info on quorum rename on rhcs 5.8
In-Reply-To: <CAG2kNCwrgWOSsUNMUetfe5GkCDSTagApLuyDtHRQB7JaE1L5kw@mail.gmail.com>
References: <CAG2kNCwrgWOSsUNMUetfe5GkCDSTagApLuyDtHRQB7JaE1L5kw@mail.gmail.com>
Message-ID: <CAG2kNCza9Ne+bPYASzD62D_efK-K=xmOjZ=sdFCg+sve4MwaYw@mail.gmail.com>

On Fri, 28 Sep 2012 12:35:26 +0200 Fabio M. Di Nitto wrote:
> - drop <quorumd device="..." and just use label=
>  (this is a general thing)

thanks for your answer Fabio.
what do you exactly mean with the sentence above? That the "device=.."
part is optional and I had better not to have in my cluster.conf?

So passing from
<quorumd device="/dev/mapper/mpnquorum" interval="5" label="nquorum"
log_facility="local4" log_level="7" tko="16" votes="1">

to
<quorumd interval="5" label="nquorum" log_facility="local4"
log_level="7" tko="16" votes="1">

in cluster.conf?

Gianluca


From fdinitto at redhat.com  Fri Sep 28 13:13:37 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 28 Sep 2012 15:13:37 +0200
Subject: [Linux-cluster] Info on quorum rename on rhcs 5.8
In-Reply-To: <CAG2kNCza9Ne+bPYASzD62D_efK-K=xmOjZ=sdFCg+sve4MwaYw@mail.gmail.com>
References: <CAG2kNCwrgWOSsUNMUetfe5GkCDSTagApLuyDtHRQB7JaE1L5kw@mail.gmail.com>
	<CAG2kNCza9Ne+bPYASzD62D_efK-K=xmOjZ=sdFCg+sve4MwaYw@mail.gmail.com>
Message-ID: <5065A281.7010208@redhat.com>

On 9/28/2012 2:47 PM, Gianluca Cecchi wrote:
> On Fri, 28 Sep 2012 12:35:26 +0200 Fabio M. Di Nitto wrote:
>> - drop <quorumd device="..." and just use label=
>>  (this is a general thing)
> 
> thanks for your answer Fabio.
> what do you exactly mean with the sentence above? That the "device=.."
> part is optional and I had better not to have in my cluster.conf?
> 
> So passing from
> <quorumd device="/dev/mapper/mpnquorum" interval="5" label="nquorum"
> log_facility="local4" log_level="7" tko="16" votes="1">
> 
> to
> <quorumd interval="5" label="nquorum" log_facility="local4"
> log_level="7" tko="16" votes="1">
> 
> in cluster.conf?

Yes that?s correct. device is unnecessary.

When both are present, label is used.

It?s not problematic for the rename operation, but it?s pointless and
can cause admin/user confusion specially when label is not on that device.

Fabio