[Linux-cluster] Continuing gfs2 problems: Am I doing something wrong????
Steven Whitehouse
swhiteho at redhat.com
Fri Sep 17 09:19:34 UTC 2010
Hi,
On Thu, 2010-09-16 at 14:43 -0600, Jeff Howell wrote:
> I'm having an identical problem.
>
> I have 2 nodes running a Wordpress instance with a TCP load balancer in
> front of them distributing http requests between them.
>
> In the last 2 days, I've had 10+ instances where the GFS2 volume hangs
> with:
>
> Sep 16 14:05:10 wordpress3 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 16 14:05:10 wordpress3 kernel: delete_workqu D 00000272 2676
> 3687 19 3688 3686 (L-TLB)
> Sep 16 14:05:10 wordpress3 kernel: f7839e38 00000046 3f1c322e
> 00000272 00000000 f57ab400 f7839df8 0000000a
> Sep 16 14:05:10 wordpress3 kernel: c3217aa0 3f1dcca8 00000272
> 00019a7a 00000001 c3217bac c3019744 f57c5ac0
> Sep 16 14:05:10 wordpress3 kernel: f8afa21c 00000003 f26162f0
> 00000000 f2213df8 00000018 c3019c00 f7839e6c
> Sep 16 14:05:10 wordpress3 kernel: Call Trace:
> Sep 16 14:05:10 wordpress3 kernel: [<f8afa21c>] gdlm_bast+0x0/0x78
> [lock_dlm]
> Sep 16 14:05:10 wordpress3 kernel: [<f8c3910e>] just_schedule+0x5/0x8
> [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<c061d2f5>] __wait_on_bit+0x33/0x58
> Sep 16 14:05:10 wordpress3 kernel: [<f8c39109>] just_schedule+0x0/0x8
> [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<f8c39109>] just_schedule+0x0/0x8
> [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<c061d37c>]
> out_of_line_wait_on_bit+0x62/0x6a
> Sep 16 14:05:10 wordpress3 kernel: [<c0436098>] wake_bit_function+0x0/0x3c
> Sep 16 14:05:10 wordpress3 kernel: [<f8c39102>]
> gfs2_glock_wait+0x27/0x2e [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<f8c4c667>]
> gfs2_check_blk_type+0xbc/0x18c [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<c061d312>] __wait_on_bit+0x50/0x58
> Sep 16 14:05:10 wordpress3 kernel: [<f8c39109>] just_schedule+0x0/0x8
> [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<f8c4c660>]
> gfs2_check_blk_type+0xb5/0x18c [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<f8c4c3c8>]
> gfs2_rindex_hold+0x2b/0x148 [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<f8c48273>]
> gfs2_delete_inode+0x6f/0x1a1 [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<f8c4823b>]
> gfs2_delete_inode+0x37/0x1a1 [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<f8c48204>]
> gfs2_delete_inode+0x0/0x1a1 [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<c048cb02>]
> generic_delete_inode+0xa5/0x10f
> Sep 16 14:05:10 wordpress3 kernel: [<c048c5a6>] iput+0x64/0x66
> Sep 16 14:05:10 wordpress3 kernel: [<f8c3a8bb>]
> delete_work_func+0x49/0x53 [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<c04332da>] run_workqueue+0x78/0xb5
> Sep 16 14:05:10 wordpress3 kernel: [<f8c3a872>]
> delete_work_func+0x0/0x53 [gfs2]
> Sep 16 14:05:10 wordpress3 kernel: [<c0433b8e>] worker_thread+0xd9/0x10b
> Sep 16 14:05:10 wordpress3 kernel: [<c041f81b>]
> default_wake_function+0x0/0xc
> Sep 16 14:05:10 wordpress3 kernel: [<c0433ab5>] worker_thread+0x0/0x10b
> Sep 16 14:05:10 wordpress3 kernel: [<c0435fa7>] kthread+0xc0/0xed
> Sep 16 14:05:10 wordpress3 kernel: [<c0435ee7>] kthread+0x0/0xed
> Sep 16 14:05:10 wordpress3 kernel: [<c0405c53>]
> kernel_thread_helper+0x7/0x10
>
> And then a bunch more for the httpd processes. I can pretty much
> reproduce this consistently by untarring a large tarball on the volume.
> Seems like anything IO intensive is causing this behavior.
>
> Running CentOS 5.5 with kernel 2.6.18-194.11.1.el5 #1 SMP Tue Aug 10
> 19:09:06 EDT 2010 i686 i686 i386 GNU/Linux
>
> I tried the hangalizer program and it always came back with:
> /bin/ls: /gfs2/: No such file or directoryhb.medianewsgroup.com "/bin/ls
> /gfs2/"
> /bin/ls: /gfs2/: No such file or directoryhb.medianewsgroup.com "/bin/ls
> /gfs2/"
> No waiting glocks found on any node.
>
> Any Ideas?
>
Can you report this via our support team? or if you don't have a support
contract at least via bugzilla so that we have a record of the problem
which won't get missed?
That doesn't look at all right to me, so I'd like to get to the bottom
of what is going on here.
> On 08/03/2010 01:38 PM, Scooter Morris wrote:
> > HI all,
> > We continue to have gfs2 crashes and hangs on our production
> > cluster, so I'm beginning to think that we've done something really
> > wrong. Here is our set-up:
> >
> > * 4 node cluster, only 3 participate in gfs2 filesystems
> > * Running several services on multiple nodes using gfs2:
> > o IMAP (dovecot)
> > o Web (apache with lots of python)
> > o Samba (using ctdb)
> > * GFS2 partitions are multipathed on an HP EVA-based SAN (no LVM)
> > -- here is fstab from one node (the three nodes are all the same):
> >
> > LABEL=/1 / ext3
> > defaults 1 1
> > LABEL=/boot1 /boot ext3
> > defaults 1 2
> > tmpfs /dev/shm tmpfs
> > defaults 0 0
> > devpts /dev/pts devpts
> > gid=5,mode=620 0 0
> > sysfs /sys sysfs
> > defaults 0 0
> > proc /proc proc
> > defaults 0 0
> > LABEL=SW-cciss/c0d0p2 swap swap
> > defaults 0 0
> > LABEL=plato:Mail /var/spool/mail gfs2
> > noatime,_netdev
> > LABEL=plato:VarTmp /var/tmp gfs2 _netdev
> > LABEL=plato:UsrLocal /usr/local gfs2
> > noatime,_netdev
> > LABEL=plato:UsrLocalProjects /usr/local/projects gfs2
> > noatime,_netdev
> > LABEL=plato:Home2 /home/socr gfs2
> > noatime,_netdev
> > LABEL=plato:HomeNoBackup /home/socr/nobackup gfs2 _netdev
> > LABEL=plato:DbBackup /databases/backups gfs2
> > noatime,_netdev
> > LABEL=plato:DbMol /databases/mol gfs2
> > noatime,_netdev
> > LABEL=plato:MolDbBlast /databases/mol/blast gfs2
> > noatime,_netdev
> > LABEL=plato:MolDbEmboss /databases/mol/emboss gfs2
> > noatime,_netdev
> >
> > * Kernel version is: 2.6.18-194.3.1.el5 and all nodes are x86_64.
> > * What's happening is every so often, we start seeing gfs2-related
> > task hangs in the logs. In the last instance (last Friday)
> > we've got this:
> >
> > Node 0:
> >
> > [2010-07-30 13:23:25]INFO: task imap:25716 blocked for
> > more than 120 seconds.^M
> > [2010-07-30 13:23:25]"echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > message.^M
> > [2010-07-30 13:23:25]imap D ffff8100010825a0
> > 0 25716 9217 24080 25667 (NOTLB)^M
> > [2010-07-30 13:23:25] ffff810619b59bc8 0000000000000086
> > ffff810113233f10 ffffffff00000000^M
> > [2010-07-30 13:23:26] ffff81000f8c5cd0 000000000000000a
> > ffff810233416040 ffff81082fd05100^M
> > [2010-07-30 13:23:26] 00012196d153c88e 0000000000008b81
> > ffff810233416228 0000000f6a949180^M
> > [2010-07-30 13:23:26]Call Trace:^M
> > [2010-07-30 13:23:26] [<ffffffff887d0be6>]
> > :gfs2:gfs2_dirent_find+0x0/0x4e^M
> > [2010-07-30 13:23:26] [<ffffffff887d0c18>]
> > :gfs2:gfs2_dirent_find+0x32/0x4e^M
> > [2010-07-30 13:23:26] [<ffffffff887d5ee7>]
> > :gfs2:just_schedule+0x0/0xe^M
> > [2010-07-30 13:23:26] [<ffffffff887d5ef0>]
> > :gfs2:just_schedule+0x9/0xe^M
> > [2010-07-30 13:23:26] [<ffffffff80063a16>]
> > __wait_on_bit+0x40/0x6e^M
> > [2010-07-30 13:23:26] [<ffffffff887d5ee7>]
> > :gfs2:just_schedule+0x0/0xe^M
> > [2010-07-30 13:23:26] [<ffffffff80063ab0>]
> > out_of_line_wait_on_bit+0x6c/0x78^M
> > [2010-07-30 13:23:26] [<ffffffff800a0aec>]
> > wake_bit_function+0x0/0x23^M
> > [2010-07-30 13:23:26] [<ffffffff887d5ee2>]
> > :gfs2:gfs2_glock_wait+0x2b/0x30^M
> > [2010-07-30 13:23:26] [<ffffffff887e579e>]
> > :gfs2:gfs2_permission+0x83/0xd5^M
> > [2010-07-30 13:23:26] [<ffffffff887e5796>]
> > :gfs2:gfs2_permission+0x7b/0xd5^M
> > [2010-07-30 13:23:26] [<ffffffff8000ce97>]
> > do_lookup+0x65/0x1e6^M
> > [2010-07-30 13:23:26] [<ffffffff8000d918>]
> > permission+0x81/0xc8^M
> > [2010-07-30 13:23:26] [<ffffffff8000997f>]
> > __link_path_walk+0x173/0xf42^M
> > [2010-07-30 13:23:26] [<ffffffff8000e9e2>]
> > link_path_walk+0x42/0xb2^M
> > [2010-07-30 13:23:26] [<ffffffff8000ccb2>]
> > do_path_lookup+0x275/0x2f1^M
> > [2010-07-30 13:23:26] [<ffffffff8001280e>]
> > getname+0x15b/0x1c2^M
> > [2010-07-30 13:23:27] [<ffffffff80023876>]
> > __user_walk_fd+0x37/0x4c^M
> > [2010-07-30 13:23:27] [<ffffffff80028846>]
> > vfs_stat_fd+0x1b/0x4a^M
> > [2010-07-30 13:23:27] [<ffffffff800638b3>]
> > schedule_timeout+0x92/0xad^M
> > [2010-07-30 13:23:27] [<ffffffff80097dab>]
> > process_timeout+0x0/0x5^M
> > [2010-07-30 13:23:27] [<ffffffff800f8435>]
> > sys_epoll_wait+0x3b8/0x3f9^M
> > [2010-07-30 13:23:27] [<ffffffff800235a8>]
> > sys_newstat+0x19/0x31^M
> > [2010-07-30 13:23:27] [<ffffffff8005d229>]
> > tracesys+0x71/0xe0^M
> > [2010-07-30 13:23:27] [<ffffffff8005d28d>]
> > tracesys+0xd5/0xe0^M
> >
> > Node 1:
> >
> > [2010-07-30 13:23:59]INFO: task pdflush:623 blocked for
> > more than 120 seconds.^M
> > [2010-07-30 13:23:59]"echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > message.^M
> > [2010-07-30 13:23:59]pdflush D ffff810407069aa0
> > 0 623 291 624 622 (L-TLB)^M
> > [2010-07-30 13:23:59] ffff8106073c1bd0 0000000000000046
> > 0000000000000001 ffff8103fea899a8^M
> > [2010-07-30 13:23:59] ffff8106073c1c30 000000000000000a
> > ffff8105fff7c0c0 ffff8107fff4c820^M
> > [2010-07-30 13:24:00] 0000ed85d9d7a027 0000000000011b50
> > ffff8105fff7c2a8 00000006f0a9d0d0^M
> > [2010-07-30 13:24:00]Call Trace:^M
> > [2010-07-30 13:24:00] [<ffffffff8001a927>]
> > submit_bh+0x10a/0x111^M
> > [2010-07-30 13:24:00] [<ffffffff88802ee7>]
> > :gfs2:just_schedule+0x0/0xe^M
> > [2010-07-30 13:24:00] [<ffffffff88802ef0>]
> > :gfs2:just_schedule+0x9/0xe^M
> > [2010-07-30 13:24:00] [<ffffffff80063a16>]
> > __wait_on_bit+0x40/0x6e^M
> > [2010-07-30 13:24:00] [<ffffffff88802ee7>]
> > :gfs2:just_schedule+0x0/0xe^M
> > [2010-07-30 13:24:00] [<ffffffff80063ab0>]
> > out_of_line_wait_on_bit+0x6c/0x78^M
> > [2010-07-30 13:24:00] [<ffffffff800a0aec>]
> > wake_bit_function+0x0/0x23^M
> > [2010-07-30 13:24:00] [<ffffffff88802ee2>]
> > :gfs2:gfs2_glock_wait+0x2b/0x30^M
> > [2010-07-30 13:24:00] [<ffffffff88813269>]
> > :gfs2:gfs2_write_inode+0x5f/0x152^M
> > [2010-07-30 13:24:00] [<ffffffff88813261>]
> > :gfs2:gfs2_write_inode+0x57/0x152^M
> > [2010-07-30 13:24:00] [<ffffffff8002fbf8>]
> > __writeback_single_inode+0x1e9/0x328^M
> > [2010-07-30 13:24:00] [<ffffffff80020ec9>]
> > sync_sb_inodes+0x1b5/0x26f^M
> > [2010-07-30 13:24:00] [<ffffffff800a08a6>]
> > keventd_create_kthread+0x0/0xc4^M
> > [2010-07-30 13:24:00] [<ffffffff8005123a>]
> > writeback_inodes+0x82/0xd8^M
> > [2010-07-30 13:24:00] [<ffffffff800c97b5>]
> > wb_kupdate+0xd4/0x14e^M
> > [2010-07-30 13:24:00] [<ffffffff80056879>] pdflush+0x0/0x1fb^M
> > [2010-07-30 13:24:00] [<ffffffff800569ca>]
> > pdflush+0x151/0x1fb^M
> > [2010-07-30 13:24:00] [<ffffffff800c96e1>]
> > wb_kupdate+0x0/0x14e^M
> > [2010-07-30 13:24:01] [<ffffffff80032894>]
> > kthread+0xfe/0x132^M
> > [2010-07-30 13:24:01] [<ffffffff8009d734>]
> > request_module+0x0/0x14d^M
> > [2010-07-30 13:24:01] [<ffffffff8005dfb1>]
> > child_rip+0xa/0x11^M
> > [2010-07-30 13:24:01] [<ffffffff800a08a6>]
> > keventd_create_kthread+0x0/0xc4^M
> > [2010-07-30 13:24:01] [<ffffffff80032796>] kthread+0x0/0x132^M
> > [2010-07-30 13:24:01] [<ffffffff8005dfa7>]
> > child_rip+0x0/0x11^M
> >
> > Node 2:
> >
> > [2010-07-30 13:24:46]INFO: task delete_workqueu:7175
> > blocked for more than 120 seconds.^M
> > [2010-07-30 13:24:46]"echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > message.^M
> > [2010-07-30 13:24:46]delete_workqu D ffff81082b5cf860
> > 0 7175 329 7176 7174 (L-TLB)^M
> > [2010-07-30 13:24:46] ffff81081ed6dbf0 0000000000000046
> > 0000000000000018 ffffffff887a84f3^M
> > [2010-07-30 13:24:46] 0000000000000286 000000000000000a
> > ffff81082dd477e0 ffff81082b5cf860^M
> > [2010-07-30 13:24:46] 00012166bf7ec21d 000000000002ed0b
> > ffff81082dd479c8 00000007887a9e5a^M
> > [2010-07-30 13:24:46]Call Trace:^M
> > [2010-07-30 13:24:46] [<ffffffff887a84f3>]
> > :dlm:request_lock+0x93/0xa0^M
> > [2010-07-30 13:24:47] [<ffffffff8884f556>]
> > :lock_dlm:gdlm_ast+0x0/0x311^M
> > [2010-07-30 13:24:47] [<ffffffff8884f2c1>]
> > :lock_dlm:gdlm_bast+0x0/0x8d^M
> > [2010-07-30 13:24:47] [<ffffffff887d3ee7>]
> > :gfs2:just_schedule+0x0/0xe^M
> > [2010-07-30 13:24:47] [<ffffffff887d3ef0>]
> > :gfs2:just_schedule+0x9/0xe^M
> > [2010-07-30 13:24:47] [<ffffffff80063a16>]
> > __wait_on_bit+0x40/0x6e^M
> > [2010-07-30 13:24:47] [<ffffffff887d3ee7>]
> > :gfs2:just_schedule+0x0/0xe^M
> > [2010-07-30 13:24:47] [<ffffffff80063ab0>]
> > out_of_line_wait_on_bit+0x6c/0x78^M
> > [2010-07-30 13:24:47] [<ffffffff800a0aec>]
> > wake_bit_function+0x0/0x23^M
> > [2010-07-30 13:24:47] [<ffffffff887d3ee2>]
> > :gfs2:gfs2_glock_wait+0x2b/0x30^M
> > [2010-07-30 13:24:47] [<ffffffff887e82cf>]
> > :gfs2:gfs2_check_blk_type+0xd7/0x1c9^M
> > [2010-07-30 13:24:47] [<ffffffff887e82c7>]
> > :gfs2:gfs2_check_blk_type+0xcf/0x1c9^M
> > [2010-07-30 13:24:47] [<ffffffff80063ab0>]
> > out_of_line_wait_on_bit+0x6c/0x78^M
> > [2010-07-30 13:24:47] [<ffffffff887e804f>]
> > :gfs2:gfs2_rindex_hold+0x32/0x12b^M
> > [2010-07-30 13:24:47] [<ffffffff887d5a29>]
> > :gfs2:delete_work_func+0x0/0x65^M
> > [2010-07-30 13:24:47] [<ffffffff887d5a29>]
> > :gfs2:delete_work_func+0x0/0x65^M
> > [2010-07-30 13:24:47] [<ffffffff887e3e3a>]
> > :gfs2:gfs2_delete_inode+0x76/0x1b4^M
> > [2010-07-30 13:24:47] [<ffffffff887e3e01>]
> > :gfs2:gfs2_delete_inode+0x3d/0x1b4^M
> > [2010-07-30 13:24:47] [<ffffffff8000d3ba>] dput+0x2c/0x114^M
> > [2010-07-30 13:24:48] [<ffffffff887e3dc4>]
> > :gfs2:gfs2_delete_inode+0x0/0x1b4^M
> > [2010-07-30 13:24:48] [<ffffffff8002f35e>]
> > generic_delete_inode+0xc6/0x143^M
> > [2010-07-30 13:24:48] [<ffffffff887d5a83>]
> > :gfs2:delete_work_func+0x5a/0x65^M
> > [2010-07-30 13:24:48] [<ffffffff8004d8f0>]
> > run_workqueue+0x94/0xe4^M
> > [2010-07-30 13:24:48] [<ffffffff8004a12b>]
> > worker_thread+0x0/0x122^M
> > [2010-07-30 13:24:48] [<ffffffff800a08a6>]
> > keventd_create_kthread+0x0/0xc4^M
> > [2010-07-30 13:24:48] [<ffffffff8004a21b>]
> > worker_thread+0xf0/0x122^M
> > [2010-07-30 13:24:48] [<ffffffff8008d087>]
> > default_wake_function+0x0/0xe^M
> > [2010-07-30 13:24:48] [<ffffffff800a08a6>]
> > keventd_create_kthread+0x0/0xc4^M
> > [2010-07-30 13:24:48] [<ffffffff800a08a6>]
> > keventd_create_kthread+0x0/0xc4^M
> > [2010-07-30 13:24:48] [<ffffffff80032894>]
> > kthread+0xfe/0x132^M
> > [2010-07-30 13:24:48] [<ffffffff8005dfb1>]
> > child_rip+0xa/0x11^M
> > [2010-07-30 13:24:48] [<ffffffff800a08a6>]
> > keventd_create_kthread+0x0/0xc4^M
> > [2010-07-30 13:24:48] [<ffffffff80032796>] kthread+0x0/0x132^M
> > [2010-07-30 13:24:48] [<ffffffff8005dfa7>]
> > child_rip+0x0/0x11^M
> >
> > * Various messages related to hung_task_timeouts repeated on each
> > node (usually related to imap).
> > * Within a minute or two, the cluster was completely hung. Root
> > could log into the console, but commands (like dmesg) would just
> > hang.
> >
> > So, my major question: is there something wrong with my
> > configuration? Have we done something really stupid? The initial
> > response from RedHat was that we shouldn't run services on multiple
> > nodes that access gfs2, which seems a little confusing since we would
> > use ext3 or ext4 if we were going to node lock (or failover) the
> > partitions. Have we missed something somewhere?
> >
That doesn't sound quite right... our guidance is not to run NFS/Samba
either together on the same GFS2 directory tree or in combination with
local applications. Otherwise there shouldn't be any issues with running
multiple applications on the same GFS2 tree/mount,
Steve.
> > Thanks in advance for any help anyone can give. We're getting pretty
> > desperate here since the downtime is starting to have a significant
> > impact on our credibility.
> >
> > -- scooter
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
More information about the Linux-cluster
mailing list