[Linux-cluster] Processes in D state

Wed Jan 5 12:56:25 UTC 2011

Hi Adam,

thanks for your help. One problem was, that we did not mounted the GFS2
file system with no noatime and nodiratime options.

We still have a problem with postfix. The gfs2 hang analyzer says:

There is 1 glock with waiters.
node4, pid 20902 is waiting for glock 6/11486739, which is held by pid 12382

Both PIDs are on the some node:

root     12382  0.0  0.0  36844  2300 ?        Ss   12:39   0:00
/usr/lib/postfix/master
root     20902  0.0  0.0  36844  2156 ?        Ds   12:45   0:00
/usr/lib/postfix/master -t

I have no idea what Postfix is trying to do here?!

Mario

Am 04.01.11 16:27, schrieb Adam Drew:
> Hello,
> 
> Processes accessing a GFS2 filesystem falling into D state is typically indicative of lock contention; however, other causes are also possible. D state is uninterruptable sleep waiting on IO. With regards to GFS2 this means that a PID has requested access to some object on disk and has not yet gained access to that object. As the PID cannot proceed until granted access it is hung in D state.
> 
> The most common cause of D state PIDs on GFS2 is lock contention. GFS2's shared locking system is more complex than traditional single-node filesystems. You can run into a situation where a given PID is locking a resource but is waiting in line for a lock on another resource to be released where the holder of that second resource is waiting on the PID holding the first to release it as well. This causes a deadlock where neither process can make process, both end up in D state, and so will any process that requests access to either of those resources as well. In other cases PIDs requesting access to a resource on disk may build up faster than than they release them. In this case the queue of waiters will build and build until the filesystem grinds to a halt and appears to "hang." In other cases bugs or design issues may lead to locking bottlenecks.
> 
> GFS2 locks are arbitrated in the glock (pronounced gee-lock) layer. The glock subsystem is exposed via debugfs. You can mount debugfs, look in the gfs2 directory, and view the glocks. You can then match up the glocks to the process list on the system and to the messages logs. Doing this for every node in the cluster can reveal problems. If you have Red Hat support I encourage you to engage them as learning to read glocks can be non-trivial process but it is not impossible. They are documented to a degree in the following documents:
> 
> "Testing and verification of cluster filesystems" by Steven Whitehouse
> http://www.kernel.org/doc/ols/2009/ols2009-pages-311-318.pdf
> 
> Global File System 2, Edition 7, section 1.4. "GFS2 Node Locking"
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/Global_File_System_2/index.html#s1-ov-lockbounce
> 
> More information is available out on the web.
> 
> Regards,
> Adam Drew
> 
> ----- Original Message -----
> From: "Emilio Arjona" <emilio at ugr.es>
> To: "linux clustering" <linux-cluster at redhat.com>
> Sent: Tuesday, January 4, 2011 6:27:52 AM
> Subject: Re: [Linux-cluster] Processes in D state
> 
> 
> Same problem here, 
> 
> 
> in a webserver cluster httpd run into D state sometimes. I have to restart the node or even the whole cluster if there are more than one node locked. I'm using REDHAT 5.4 and HP hardware. 
> 
> 
> Regards, 
> 
> 
> 2011/1/4 Paras pradhan < pradhanparas at gmail.com > 
> 
> 
> I had the same problem. it locked the whole gfs cluster and had to 
> reboot the node. after reboot all is fine now but still trying to find 
> out what has caused it. 
> 
> Paras 
> 
> On Monday, January 3, 2011, InterNetworX | Hostmaster 
> 
> 
> 
> < hostmaster at inwx.de > wrote: 
>> Hello, 
>>
>> we are using GFS2 but sometimes there are processes hanging in D state: 
>>
>> # ps axl | grep D 
>> F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 
>> 0 0 14220 14219 20 0 19624 1916 - Ds ? 0:00 
>> /usr/lib/postfix/master -t 
>> 0 0 14555 14498 20 0 16608 1716 - D+ 
>> /mnt/storage/openvz/root/129/dev/pts/0 0:00 apt-get install less 
>> 0 0 15068 15067 19 -1 36844 2156 - D<s ? 0:00 
>> /usr/lib/postfix/master -t 
>> 0 0 16603 16602 19 -1 36844 2156 - D<s ? 0:00 
>> /usr/lib/postfix/master -t 
>> 4 101 19534 13238 19 -1 33132 2984 - D< ? 0:00 
>> smtpd -n smtp -t inet -u -c 
>> 4 101 19542 13238 19 -1 33116 2976 - D< ? 0:00 
>> smtpd -n smtp -t inet -u -c 
>> 0 0 19735 13068 20 0 7548 880 - S+ pts/0 0:00 grep D 
>>
>> dmesg shows this message many times: 
>>
>> [11142.334229] INFO: task master:14220 blocked for more than 120 seconds. 
>> [11142.334266] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>> disables this message. 
>> [11142.334310] master D ffff88032b644800 0 14220 14219 
>> 0x00000000 
>> [11142.334315] ffff88062dd40000 0000000000000086 0000000000000000 
>> ffffffffa02628d9 
>> [11142.334318] ffff88017a517ef8 000000000000fa40 ffff88017a517fd8 
>> 0000000000016940 
>> [11142.334322] 0000000000016940 ffff88032b644800 ffff88032b644af8 
>> 0000000b7a517cd8 
>> [11142.334325] Call Trace: 
>> [11142.334340] [<ffffffffa02628d9>] ? gfs2_glock_put+0xf9/0x118 [gfs2] 
>> [11142.334347] [<ffffffffa0261db0>] ? gfs2_glock_holder_wait+0x0/0xd [gfs2] 
>> [11142.334353] [<ffffffffa0261db9>] ? gfs2_glock_holder_wait+0x9/0xd [gfs2] 
>> [11142.334358] [<ffffffff812e9897>] ? __wait_on_bit+0x41/0x70 
>> [11142.334363] [<ffffffffa0261db0>] ? gfs2_glock_holder_wait+0x0/0xd [gfs2] 
>> [11142.334367] [<ffffffff812e9931>] ? out_of_line_wait_on_bit+0x6b/0x77 
>> [11142.334370] [<ffffffff81066808>] ? wake_bit_function+0x0/0x23 
>> [11142.334376] [<ffffffffa0261d9e>] ? gfs2_glock_wait+0x23/0x28 [gfs2] 
>> [11142.334383] [<ffffffffa026b2b0>] ? gfs2_flock+0x17c/0x1f9 [gfs2] 
>> [11142.334386] [<ffffffff810e735d>] ? virt_to_head_page+0x9/0x2a 
>> [11142.334389] [<ffffffff810e743e>] ? ub_slab_ptr+0x22/0x65 
>> [11142.334393] [<ffffffff8112221b>] ? sys_flock+0xff/0x12a 
>> [11142.334396] [<ffffffff81010c12>] ? system_call_fastpath+0x16/0x1b 
>>
>> Any idea what is going wrong? Do you need any more informations? 
>>
>> Mario 
>>
>> -- 
>> Linux-cluster mailing list 
>> Linux-cluster at redhat.com 
>> https://www.redhat.com/mailman/listinfo/linux-cluster 
>>
>