[Linux-cluster] (no subject)

Tue Jan 10 11:12:08 UTC 2012

RHCS/GFS2 support team,

I would like to inform you about a serious GFS2 problem we encountered last week.
Please find a detailed description below. I have enclosed a tarfile containing
detailed information about this problem.

Description
        Two-node cluster is used as a test cluster without any load.
        Only functionality is tested, no performance tests. The RHCS services
        that run on this cluster are rather standard services.
        In a 2-day timeframe we had two occurrences of this problem which were
        both very similar.
        On the 2nd  node, a Perl script tried to write some info to a file on
        the GFS2 filesystem, but the process hung at that time. From the GFS2
        lockdump info we saw one W-lock associated with an inode and it
        turned out that the inode was a directory on GFS2. Every command executed on
        that file (eg. ls -l) or on this directory resulted in a hang of that
        process (eg. du <dirname>).
        The processes that hung all had the D-state (uninterruptable sleep).
        However, from the 1st  node all files and directories were accessible without
        any problem. Even ls -lR executed on the 1st node from top of the GFS2
        filesystem traversed the full directory tree without problems.
        We suspect that the offending directory has got a W-lock and that there is
        no lock owner anymore.
        So, it does not look like a 'global' file system hang, but it seems to
        to be a local problem on the 2nd  node, where the major part of the GFS2
        is also accessible from the 2nd node, except the dir with the lock.
        Needless to say that this causes the application to be unavailable.

                  We are unable to reproduce the problem.

        1st occurrence. After collecting information, we rebooted the 2nd node and after
        the reboot it joined the 1st node in the cluster without any problem.

        2nd occurrence. This happened 2 days later in the same way on the same node. After
        collecting information, we now also ran gfs2_fsck on the GFS2 filesystem
        before letting it join the cluster. No errors, orphans, corruption was reported.

        After the fsck we started the cluster software on the 2nd  node and the 2nd
        node joined the cluster without any problem.
        Additional information (gfs2_lockdump, gfs2_hangalyzer, sysrq-t info, etc.) was
        collected in a tarball (enov_additional_info.tar).

Additional information in additional_info.tar
- enov_clusterinfo_app2.txt.gz containing
                        - /etc/cluster.conf
                        - gfs2_hangalyzer output from 2nd node
                        - cman_tool <version, status, services, -af nodes>
                        - group_tool < -v, dump, dump fence, dump gfs2>
                        - ccs_tool <lsnode, lsfence>
                        - openais-cfgtool -s
                        - clustat -fl
                        - Process status information of all processes
                        - gfs2_tool gettune /gfsdata

                - enov_sysrq-t_app2.txt.gz
                - enov_glocks_app2.txt.gz
                - enov_debugfs_dlm_app2.tar.gz Contains compressed tarball of dlm
                  directory from debugfs filesystem from 2nd node.
Environment
        2-node cluster running CentOS 5.7, with RedHat Cluster Suite and GFS2.
        Latest updates for OS and RHCS/GFS2 (as per Jan 8, 2012) are installed.
        Kernel version 2.6.18-274.12.1.el5PAE.
        One GFS2 filesystem (20G) on HP/LeftHand Networks iSCSI SAN volume.
        iSCSI initiator version 6.2.0.872-10.el5.

Thanking you in advance for your cooperation.
If you need additional information to help to solve this problem, please let me know.

With kind regards,
G. Wieberdink
Sr. Engineer at E.Novation

gert.wieberdink at enovation.nl<mailto:gert.wieberdink at enovation.nl>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/0d9ffd30/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: enov_additional_info.tar
Type: application/x-tar
Size: 102400 bytes
Desc: enov_additional_info.tar
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120110/0d9ffd30/attachment.tar>