[Linux-cluster] Linux-cluster Digest, Vol 92, Issue 16

Wed Dec 21 19:45:00 UTC 2011

On Dec 21, 2011, at 12:34 PM, SATHYA - IT wrote:

> Hi Adam,
> 
> Thanks for your response. We are not currently having any redhat support for
> HA and RS. We have the support only for the Server OS. 2 Nodes are running
> with RHEL 6.2 in the cluster environment. The withdrawn message from the log
> file are as follows:
> 
> Dec 21 10:32:43 filesrv2 avahi-daemon[9585]: Registering new address record
> for 192.168.129.15 on bond0.IPv4.
> Dec 21 10:33:10 filesrv2 kernel: GFS2: fsid=samba:hadata01.1: fatal:
> filesystem consistency error
> Dec 21 10:33:10 filesrv2 kernel: GFS2: fsid=samba:hadata01.1:   RG =
> 160469200
> Dec 21 10:33:10 filesrv2 kernel: GFS2: fsid=samba:hadata01.1:   function =
> gfs2_setbit, file = fs/gfs2/rgrp.c, line = 95
> Dec 21 10:33:10 filesrv2 kernel: GFS2: fsid=samba:hadata01.1: about to
> withdraw this file system
> Dec 21 10:33:10 filesrv2 kernel: GFS2: fsid=samba:hadata01.1: telling LM to
> unmount
> Dec 21 10:33:10 filesrv2 kernel: GFS2: fsid=samba:hadata01.1: withdrawn
> Dec 21 10:33:10 filesrv2 kernel: Pid: 26976, comm: smbd Not tainted
> 2.6.32-220.el6.x86_64 #1
> Dec 21 10:33:10 filesrv2 kernel: Call Trace:
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa05508e2>] ?
> gfs2_lm_withdraw+0x102/0x130 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffff81090bdf>] ?
> wake_up_bit+0x2f/0x40
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa0550a8a>] ?
> gfs2_consist_rgrpd_i+0x4a/0x50 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa054b5d0>] ?
> rgblk_free+0x1f0/0x200 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa054b992>] ?
> gfs2_free_data+0x42/0x130 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa0524f80>] ? do_strip+0x450/0x470
> [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa05251bf>] ?
> recursive_scan.clone.0+0xbf/0x280 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffff81111aa7>] ?
> find_lock_page+0x37/0x80
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffff8115efb5>] ?
> kmem_cache_alloc_notrace+0x115/0x130
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa052548d>] ?
> trunc_dealloc+0x10d/0x130 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa0537be1>] ?
> gfs2_log_commit+0x1c1/0x300 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa0526df3>] ?
> gfs2_truncatei+0x4b3/0x820 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa0543569>] ?
> gfs2_setattr+0x119/0x3d0 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffffa0543496>] ?
> gfs2_setattr+0x46/0x3d0 [gfs2]
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffff81192698>] ?
> notify_change+0x168/0x340
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffff81174de4>] ?
> do_truncate+0x64/0xa0
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffff811750b0>] ?
> sys_ftruncate+0xf0/0x100
> Dec 21 10:33:10 filesrv2 kernel: [<ffffffff8100b308>] ? tracesys+0xd9/0xde
> Dec 21 10:33:16 filesrv2 avahi-daemon[9585]: Withdrawing address record for
> 192.168.129.15 on bond0.
> Dec 21 10:36:20 filesrv2 kernel: INFO: task gfs2_logd:9769 blocked for more
> than 120 seconds.
> Dec 21 10:36:20 filesrv2 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 21 10:36:20 filesrv2 kernel: gfs2_logd     D ffff8808a7824100     0
> 9769      2 0x00000000
> Dec 21 10:36:20 filesrv2 kernel: ffff88087fe51dd0 0000000000000046
> 0000000000000000 000000004db7b07d
> Dec 21 10:36:20 filesrv2 kernel: ffff88084d820cf8 0000000000000441
> ffff88087fe51d70 ffffffff811a81be
> Dec 21 10:36:20 filesrv2 kernel: ffff880888437af8 ffff88087fe51fd8
> 000000000000f4e8 ffff880888437af8
> Dec 21 10:36:20 filesrv2 kernel: Call Trace:
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffff811a81be>] ?
> submit_bh+0x10e/0x150
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffff814ed1e3>] io_schedule+0x73/0xc0
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffffa05389ca>]
> gfs2_log_flush+0x46a/0x6e0 [gfs2]
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffffa053736f>] ?
> gfs2_ail1_empty+0x2f/0x1b0 [gfs2]
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffff81090bf0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffffa0538d17>] gfs2_logd+0xd7/0x140
> [gfs2]
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffffa0538c40>] ? gfs2_logd+0x0/0x140
> [gfs2]
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
> Dec 21 10:36:20 filesrv2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> Dec 21 10:38:20 filesrv2 kernel: INFO: task gfs2_logd:9769 blocked for more
> than 120 seconds.
> Dec 21 10:38:20 filesrv2 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 21 10:38:20 filesrv2 kernel: gfs2_logd     D ffff8808a7824100     0
> 9769      2 0x00000000
> Dec 21 10:38:20 filesrv2 kernel: ffff88087fe51dd0 0000000000000046
> 0000000000000000 000000004db7b07d
> Dec 21 10:38:20 filesrv2 kernel: ffff88084d820cf8 0000000000000441
> ffff88087fe51d70 ffffffff811a81be
> Dec 21 10:38:20 filesrv2 kernel: ffff880888437af8 ffff88087fe51fd8
> 000000000000f4e8 ffff880888437af8
> Dec 21 10:38:20 filesrv2 kernel: Call Trace:
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffff811a81be>] ?
> submit_bh+0x10e/0x150
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffff814ed1e3>] io_schedule+0x73/0xc0
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffffa05389ca>]
> gfs2_log_flush+0x46a/0x6e0 [gfs2]
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffffa053736f>] ?
> gfs2_ail1_empty+0x2f/0x1b0 [gfs2]
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffff81090bf0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffffa0538d17>] gfs2_logd+0xd7/0x140
> [gfs2]
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffffa0538c40>] ? gfs2_logd+0x0/0x140
> [gfs2]
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
> Dec 21 10:38:20 filesrv2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> Dec 21 10:40:20 filesrv2 kernel: INFO: task gfs2_logd:9769 blocked for more
> than 120 seconds.
> Dec 21 10:40:20 filesrv2 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 21 10:40:20 filesrv2 kernel: gfs2_logd     D ffff8808a7824100     0
> 9769      2 0x00000000
> Dec 21 10:40:20 filesrv2 kernel: ffff88087fe51dd0 0000000000000046
> 0000000000000000 000000004db7b07d
> Dec 21 10:40:20 filesrv2 kernel: ffff88084d820cf8 0000000000000441
> ffff88087fe51d70 ffffffff811a81be
> Dec 21 10:40:20 filesrv2 kernel: ffff880888437af8 ffff88087fe51fd8
> 000000000000f4e8 ffff880888437af8
> Dec 21 10:40:20 filesrv2 kernel: Call Trace:
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffff811a81be>] ?
> submit_bh+0x10e/0x150
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffff814ed1e3>] io_schedule+0x73/0xc0
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffffa05389ca>]
> gfs2_log_flush+0x46a/0x6e0 [gfs2]
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffffa053736f>] ?
> gfs2_ail1_empty+0x2f/0x1b0 [gfs2]
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffff81090bf0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffffa0538d17>] gfs2_logd+0xd7/0x140
> [gfs2]
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffffa0538c40>] ? gfs2_logd+0x0/0x140
> [gfs2]
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
> Dec 21 10:40:20 filesrv2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> Dec 21 10:42:20 filesrv2 kernel: INFO: task gfs2_logd:9769 blocked for more
> than 120 seconds.
> Dec 21 10:42:20 filesrv2 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 21 10:42:20 filesrv2 kernel: gfs2_logd     D ffff8808a7824100     0
> 9769      2 0x00000000
> Dec 21 10:42:20 filesrv2 kernel: ffff88087fe51dd0 0000000000000046
> 0000000000000000 000000004db7b07d
> Dec 21 10:42:20 filesrv2 kernel: ffff88084d820cf8 0000000000000441
> ffff88087fe51d70 ffffffff811a81be
> Dec 21 10:42:20 filesrv2 kernel: ffff880888437af8 ffff88087fe51fd8
> 000000000000f4e8 ffff880888437af8
> Dec 21 10:42:20 filesrv2 kernel: Call Trace:
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffff811a81be>] ?
> submit_bh+0x10e/0x150
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffff814ed1e3>] io_schedule+0x73/0xc0
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffffa05389ca>]
> gfs2_log_flush+0x46a/0x6e0 [gfs2]
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffffa053736f>] ?
> gfs2_ail1_empty+0x2f/0x1b0 [gfs2]
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffff81090bf0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffffa0538d17>] gfs2_logd+0xd7/0x140
> [gfs2]
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffffa0538c40>] ? gfs2_logd+0x0/0x140
> [gfs2]
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
> Dec 21 10:42:20 filesrv2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> Dec 21 10:43:42 filesrv2 kernel: GFS2: fsid=samba:gen01.1: fatal: invalid
> metadata block
> Dec 21 10:43:42 filesrv2 kernel: GFS2: fsid=samba:gen01.1:   bh = 51194408
> (magic number)
> Dec 21 10:43:42 filesrv2 kernel: GFS2: fsid=samba:gen01.1:   function =
> gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 401
> Dec 21 10:43:42 filesrv2 kernel: GFS2: fsid=samba:gen01.1: about to withdraw
> this file system
> Dec 21 10:43:42 filesrv2 kernel: GFS2: fsid=samba:gen01.1: telling LM to
> unmount
> Dec 21 10:43:42 filesrv2 kernel: GFS2: fsid=samba:gen01.1: withdrawn
> Dec 21 10:43:42 filesrv2 kernel: Pid: 9710, comm: glock_workqueue Not
> tainted 2.6.32-220.el6.x86_64 #1
> Dec 21 10:43:42 filesrv2 kernel: Call Trace:
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffffa05508e2>] ?
> gfs2_lm_withdraw+0x102/0x130 [gfs2]
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffff81090c30>] ?
> wake_bit_function+0x0/0x50
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffffa0550a35>] ?
> gfs2_meta_check_ii+0x45/0x50 [gfs2]
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffffa053b4a5>] ?
> gfs2_meta_indirect_buffer+0x185/0x190 [gfs2]
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffffa0535e49>] ?
> gfs2_inode_refresh+0x29/0x340 [gfs2]
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffff810ea694>] ?
> rb_reserve_next_event+0xb4/0x370
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffffa0535488>] ?
> inode_go_lock+0x88/0xf0 [gfs2]
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffffa0533c07>] ?
> do_promote+0x1c7/0x340 [gfs2]
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffffa0533ef8>] ?
> finish_xmote+0x178/0x410 [gfs2]
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffffa0534d03>] ?
> glock_work_func+0x133/0x1b0 [gfs2]
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffffa0534bd0>] ?
> glock_work_func+0x0/0x1b0 [gfs2]
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffff8108b2b0>] ?
> worker_thread+0x170/0x2a0
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffff81090bf0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffff8108b140>] ?
> worker_thread+0x0/0x2a0
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffff81090886>] ? kthread+0x96/0xa0
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
> Dec 21 10:43:42 filesrv2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
> 
> Thanks
> 
> Sathya Narayanan V
> Solution Architect	
> M +91 9940680173 |T +91 44 42199500  | Service Desk +91 44 42199521
> SERVICE - In PRECISION IT is a PASSION
> ----------------------------------------------------------------------------
> -----------------------------
> Precision Infomatic (M) Pvt Ltd
> 22, 1st Floor, Habibullah Road, T. Nagar, Chennai - 600 017. India.
> www.precisionit.co.in
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of
> linux-cluster-request at redhat.com
> Sent: Wednesday, December 21, 2011 10:30 PM
> To: linux-cluster at redhat.com
> Subject: Linux-cluster Digest, Vol 92, Issue 16
> 
> Send Linux-cluster mailing list submissions to
> 	linux-cluster at redhat.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://www.redhat.com/mailman/listinfo/linux-cluster
> or, via email, send a message with subject or body 'help' to
> 	linux-cluster-request at redhat.com
> 
> You can reach the person managing the list at
> 	linux-cluster-owner at redhat.com
> 
> When replying, please edit your Subject line so it is more specific than
> "Re: Contents of Linux-cluster digest..."
> 
> 
> Today's Topics:
> 
>   1. GFS2 Consistency... (SATHYA - IT)
>   2. Re: GFS2 Consistency... (Adam Drew)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Wed, 21 Dec 2011 16:11:40 +0530
> From: "SATHYA - IT" <sathyanarayanan.varadharajan at precisionit.co.in>
> To: <linux-cluster at redhat.com>
> Subject: [Linux-cluster] GFS2 Consistency...
> Message-ID: <00e701ccbfcd$20d1c100$62754300$@precisionit.co.in>
> Content-Type: text/plain; charset="us-ascii"
> 
> Hi,
> 
> 
> 
> We are having an cluster environment running on GFS2 + CTDB + Samba. Due to
> some unavoidable circumstances we were forced to hard reboot the server 2 to
> 3 times. After the 3rd time restart, everything worked fine without any
> issues. But after 4 to 5 hours online, we got a trigger stating File System
> consistency error in one of the GFS2 partition. Hard reboot of 2 to 3 times
> a server, whether it affects the GFS2 file system. Is that the file system
> is that much sensitive. Whereas we won't have any issues in ext3/ext4 file
> system earlier in related scenarios. Can anyone revert on the GFS2
> consistency and its recommendation to run in production environment.
> 
> 
> 
> 
> 
> Thanks
> 
> 
> 
> Sathya Narayanan V
> 
> Solution Architect    
> 
> M +91 9940680173 |T +91 44 42199500  | Service Desk +91 44 42199521 SERVICE
> - In PRECISION IT is a PASSION
> ----------------------------------------------------------------------------
> -----------------------------
> Precision Infomatic (M) Pvt Ltd
> 22, 1st Floor, Habibullah Road, T. Nagar, Chennai - 600 017. India.
> <http://www.precisionit.co.in/> www.precisionit.co.in
> 
> 
> 
> 
> This communication may contain confidential information. 
> If you are not the intended recipient it may be unlawful for you to read,
> copy, distribute, disclose or otherwise use the information contained within
> this communication.. 
> Errors and Omissions may occur in the contents of this Email arising out of
> or in connection with data transmission, network malfunction or failure,
> machine or software error, malfunction, or operator errors by the person who
> is sending the email. 
> Precision Group accepts no responsibility for any such errors or omissions.
> The information, views and comments within this communication are those of
> the individual and not necessarily those of Precision Group. 
> All email that is sent from/to Precision Group is scanned for the presence
> of computer viruses, security issues and inappropriate content. However, it
> is the recipient's responsibility to check any attachments for viruses
> before use.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <https://www.redhat.com/archives/linux-cluster/attachments/20111221/09f80f92
> /attachment.html>
> 
> ------------------------------
> 
> Message: 2
> Date: Wed, 21 Dec 2011 10:09:46 -0500
> From: Adam Drew <adrew at redhat.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] GFS2 Consistency...
> Message-ID: <C052F150-E6BA-41C5-B4C8-EC719105B73B at redhat.com>
> Content-Type: text/plain; charset="windows-1252"
> 
> 
>> Hi,
>> 
>> We are having an cluster environment running on GFS2 + CTDB + Samba. Due
> to some unavoidable circumstances we were forced to hard reboot the server 2
> to 3 times. After the 3rd time restart, everything worked fine without any
> issues. But after 4 to 5 hours online, we got a trigger stating File System
> consistency error in one of the GFS2 partition. Hard reboot of 2 to 3 times
> a server, whether it affects the GFS2 file system. Is that the file system
> is that much sensitive. Whereas we won?t have any issues in ext3/ext4 file
> system earlier in related scenarios. Can anyone revert on the GFS2
> consistency and its recommendation to run in production environment.
>> 
>> 
>> Thanks
>> 
>> Sathya Narayanan V
>> Solution Architect   
>> M +91 9940680173 |T +91 44 42199500  | Service Desk +91 44 42199521 
>> SERVICE - In PRECISION IT is a PASSION
>> ----------------------------------------------------------------------
>> -----------------------------------
>> Precision Infomatic (M) Pvt Ltd
>> 22, 1st Floor, Habibullah Road, T. Nagar, Chennai - 600 017. India.
>> www.precisionit.co.in
>> 
>> 
>> This communication may contain confidential information. If you are not
> the intended recipient it may be unlawful for you to read, copy, distribute,
> disclose or otherwise use the information contained within this
> communication.. Errors and Omissions may occur in the contents of this Email
> arising out of or in connection with data transmission, network malfunction
> or failure, machine or software error, malfunction, or operator errors by
> the person who is sending the email. Precision Group accepts no
> responsibility for any such errors or omissions. The information, views and
> comments within this communication are those of the individual and not
> necessarily those of Precision Group. All email that is sent from/to
> Precision Group is scanned for the presence of computer viruses, security
> issues and inappropriate content. However, it is the recipient's
> responsibility to check any attachments for viruses before use. 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> Hello Sathya,
> 
> If you are experiencing GFS2 withdraws you may be running into a bug ,
> filesystem corruption, or both. If you have a Red Hat support contract I
> suggest opening a support case with Red Hat as soon as possible. When you
> open the support case you'll want to attach sosreports from all nodes (run
> the sosreport command on every node in the cluster and attach the resultant
> tarballs to the support case.) If you've hit a withdraw you are likely to
> keep hitting them and data loss or corruption is a tangible possibility; Red
> Hat support can help identify the source of the issue and provide relief.
> 
> If you don't have a Red Hat support contract then please reply to the thread
> with the kernel versions you are running on all nodes and the full withdraw
> message and call traces from the messages logs on the affected cluster.
> You'll be able to identify the withdraw easily in the logs. We'll want the
> withdraw messages which will include a pointer to the position in code where
> the error occurred and the nature of the withdraw. We'll also need the stack
> trace that follows the withdraw as it will allow us to understand the code
> path involved.
> 
> Thanks,
> Adam
> 
> --
> Adam Drew
> Software Maintenance Engineer
> Support Engineering Group
> Red Hat, Inc.
> Desk: (919) 754-4126
> Cell: (919) 389-5334
> 
> 
> 
> 
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <https://www.redhat.com/archives/linux-cluster/attachments/20111221/b7c316f2
> /attachment.html>
> 
> ------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> End of Linux-cluster Digest, Vol 92, Issue 16
> *********************************************
> 
> This communication may contain confidential information. 
> If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
> Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
> Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
> All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

We're withdrawing in gfs2_meta_indirect_buffer which is a function that loads metadata into a buffer for use.  Where we are specifically failing is in a call to gfs2_metatype_check which does of the work of ensuring that the metadata we're loading into a buffer is of the type we expect. There's a macro and another function we pass through but ultimately we end up in gfs2_metatype_check_i which compares the expected metadata type with the type found in the buffer loaded from disk. If they don't match we withdraw.

So what does this all mean? It means that either the data on disk is corrupt (a section of what we expect should be metadata is not, or is the wrong kind of metadata) or it is some kind of memory corruption where the data in memory is being corrupted such that when we examine the buffer it appears to be the wrong type. From what I have here to analyze I cannot say which it is.

Your first action should be to unmount the filesystem in question from all nodes, update gfs2-utils, and run a gfs2_fsck on the filesystem. After the filesystem check is completed you can mount the filesystem back up and return to production. If the issue goes away then it was some anomalous sort of on-disk corruption. If the issue comes back then it is quite likely to either be a bug in GFS2 or something very wrong with the environment or workload (such as mounted without locking, or something doing block-level writes to metadata areas on disk, or something of that nature.)

If you find that you encounter further difficulties with the filesystem post-fsck I would advise, if you can, purchasing support for the Resilient Storage add-on entitlement and engaging support so that my group and I can assist you further. If you are unable to do so then you can create a bug report at bugzilla.redhat.com; but note that there are no production SLAs on bugzilla.

Good luck. I hope this helps in some capacity.

Thanks,
Adam

--
Adam Drew
Software Maintenance Engineer
Support Engineering Group
Red Hat, Inc.
Desk: (919) 754-4126
Cell: (919) 389-5334

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111221/bd54cd8b/attachment.htm>