[Linux-cluster] problem with deadlocked processes (D)

Mark Hlawatschek hlawatschek at atix.de
Wed Apr 4 13:58:41 UTC 2007


Hi,

I observed quite the same problem at some time. 
There's the bugzilla entry I opened:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=228916

Mark

On Wednesday 04 April 2007 15:18:18 Peter Sopko wrote:
> Hi,
>
> thanks for your reply Bryn.
>
> The output of the ps command you suggested (i've ommited the standard
> system processes) :
>
> [root at mail1 subsys]# ps ax -ocomm,pid,state,wchan |more
> COMMAND            PID S WCHAN
> ccsd              2258 S -
> cman_comms        2310 S cluster_kthread
> cman_serviced     2312 S serviced
> cman_memb         2311 S membership_kthread
> cman_hbeat        2315 S hello_kthread
> fenced            2336 S rt_sigsuspend
> dlm_astd          2354 S dlm_astd
> dlm_recvd         2355 S dlm_recvd
> dlm_sendd         2356 S dlm_sendd
> lock_dlm1         2358 S dlm_async
> lock_dlm2         2359 S dlm_async
> gfs_scand         2360 S -
> gfs_glockd        2361 S gfs_glockd
> gfs_recoverd      2362 S -
> gfs_logd          2363 S -
> gfs_quotad        2364 D glock_wait_internal
> gfs_inoded        2365 D dlm_lock_sync
> syslogd           2374 S -
> klogd             2394 S syslog
> heartbeat         2503 S -
> courierlogger     2526 S pipe_wait
> authdaemond       2527 S -
> authdaemond       2551 S -
> authdaemond       2552 S -
> authdaemond       2553 S -
> authdaemond       2554 S -
> authdaemond       2555 S -
> heartbeat         2586 S pipe_wait
> heartbeat         2587 S -
> heartbeat         2588 S -
> heartbeat         2589 S -
> heartbeat         2590 S -
> acpid             2595 S -
> ipfail            2608 S -
> nod32d            2609 S -
> nod32smtp         2618 S -
> sshd              2627 S -
> ntpd              2642 S -
> courierlogger     2654 S pipe_wait
> couriertcpd       2655 S -
> courierlogger     2661 S pipe_wait
> couriertcpd       2662 S -
> courierlogger     2667 S pipe_wait
> couriertcpd       2668 S wait
> courierlogger     2673 S pipe_wait
> couriertcpd       2674 S -
> master            2815 S -
> master            3024 S -
> httpd             3039 S -
> crond             3048 S -
> rhnsd             3067 S -
> mingetty          3074 S -
> mingetty          3075 S -
> mingetty          3076 S -
> mingetty          3077 S -
> mingetty          3078 S -
> mingetty          3079 S -
> ntpd              3888 S rt_sigsuspend
> tlsmgr            4544 S -
> tlsmgr            1585 S -
> anvil             1699 S -
> spamd            29941 S -
> httpd            15674 D glock_wait_internal
> httpd            15675 D glock_wait_internal
> httpd            15676 D glock_wait_internal
> httpd            15677 D glock_wait_internal
> httpd            15678 D glock_wait_internal
> httpd            15679 D glock_wait_internal
> httpd            15680 D glock_wait_internal
> httpd            15681 D glock_wait_internal
> httpd            30808 D glock_wait_internal
> httpd            30809 D glock_wait_internal
> httpd            30810 D glock_wait_internal
> httpd            30825 D glock_wait_internal
> httpd            30827 D glock_wait_internal
> httpd            30828 D glock_wait_internal
> httpd            30829 D glock_wait_internal
> httpd            30830 D glock_wait_internal
> httpd            30831 D glock_wait_internal
> httpd            30832 D glock_wait_internal
> httpd            30835 D glock_wait_internal
> httpd            30840 D glock_wait_internal
> spamd            17341 S -
> proxymap         24868 S -
> proxymap         27542 S -
> mysqld_safe      30617 S wait
> mysqld           30650 S -
> trivial-rewrite  30735 S -
> proxymap         30742 S -
> sshd               517 S -
> sshd               519 S -
> bash               520 S wait
> su                 740 S wait
> bash               741 S -
> imapd            15018 D lock_on_glock
> virtual          15699 D lock_on_glock
> trivial-rewrite  15918 S -
> proxymap         15922 S -
> virtual          15943 D lock_on_glock
> virtual          15952 D lock_on_glock
> virtual          15965 D lock_on_glock
> pop3d            15966 D lock_on_glock
> pop3d            15967 D lock_on_glock
> virtual          15968 D lock_on_glock
> pop3d            15971 D lock_on_glock
> pop3d            15983 D lock_on_glock
> virtual          16046 D lock_on_glock
> pop3d            16049 D lock_on_glock
> pop3d            16053 D lock_on_glock
> pop3d            16068 D glock_wait_internal
> pop3d            16074 D glock_wait_internal
> virtual          16077 D lock_on_glock
> spamd            16112 S -
> virtual          16129 D lock_on_glock
> virtual          16133 D lock_on_glock
> pop3d            16143 D glock_wait_internal
> virtual          16153 D lock_on_glock
> virtual          16160 D glock_wait_internal
> virtual          16163 D lock_on_glock
> pop3d            16164 D glock_wait_internal
> virtual          16179 D lock_on_glock
> pop3d            16183 D glock_wait_internal
> pop3d            16186 D glock_wait_internal
> pop3d            16187 D glock_wait_internal
> virtual          16191 D lock_on_glock
> pop3d            16192 D lock_on_glock
> virtual          16194 D lock_on_glock
> pop3d            16202 D glock_wait_internal
> virtual          16207 D lock_on_glock
> virtual          16217 D lock_on_glock
> virtual          16222 D lock_on_glock
> ....
> smtp             21150 S -
> smtp             21162 S flock_lock_file_wait
> cleanup          21181 S flock_lock_file_wait
> smtpd            21213 S -
> spamfilter.sh    21224 S wait
> cat              21225 S pipe_wait
> spamfilter.sh    21226 D -
> spamfilter.sh    21229 S wait
> pipe             21230 S -
> cat              21231 S pipe_wait
> spamfilter.sh    21232 D -
> spamfilter.sh    21235 S wait
> cat              21236 S pipe_wait
> spamfilter.sh    21237 D -
> spamfilter.sh    21239 S wait
> spamfilter.sh    21240 S wait
> cat              21242 S pipe_wait
> spamfilter.sh    21243 D -
> virtual          21244 D lock_on_glock
> cat              21245 S pipe_wait
> spamfilter.sh    21246 D -
> spamfilter.sh    21249 S wait
> cat              21250 S pipe_wait
> spamfilter.sh    21251 D -
> spamfilter.sh    21252 S wait
> cat              21253 S pipe_wait
> spamfilter.sh    21254 D -
> spamfilter.sh    21257 S wait
> cat              21258 S pipe_wait
> spamfilter.sh    21259 D -
> spamfilter.sh    21261 S wait
> spamfilter.sh    21262 S wait
> spamfilter.sh    21263 S wait
> cat              21264 S pipe_wait
> spamfilter.sh    21265 D -
> spamfilter.sh    21267 D -
> cat              21268 S pipe_wait
> spamfilter.sh    21269 D -
> spamfilter.sh    21273 S wait
> ...
> etc....
>
>
> The sysrq-t output is to be found on this url -
> http://www.backbone.sk/sysrq.tar. It's 400k in size, so I have chosen not
> to attach it as in here. There are two files in this .tar - one was taken
> 15:04 and the other one on 15:08.
>
> Again I will be very thankful for any help.
>
> Peter Sopko, IT Security Consultant
> Tempest a.s.
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bryn M. Reeves
> Sent: Wednesday, April 04, 2007 2:45 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] problem with deadlocked processes (D)
>
> Peter Sopko wrote:
> > Hi,
> >
> > today a strange thing occurred - on both of our cluster nodes a lot of
> > processes suddenly started to become locked in the D state (i/o lock).
>
> This
>
> > thing has already happened once before (six months ago), but a simple
>
> reboot
>
> > helped to solve this issue. But as it appeared again, I don't want to
>
> solve
>
> > it this way again, I would like to find the reason why this is happening,
> > but have no idea where to start. In /var/log/messages there is nothing
> > unusual, the only thing is that some directories are unremoveable and a
>
> lot
>
> > of processes locked.
>
> For problems where processes are getting stuck in D state it's usually
> helpful to get sysrq-t data to see where the threads are stuck. Grab two
> sets of data a few seconds apart so that you can see if things are
> really stuck or just making slow progress.
>
> You can also get some information from the wchan data exposed in /proc -
> it's easiest to view with ps:
>
> $ ps ax -ocomm,pid,state,wchan
> COMMAND           PID S WCHAN
> vim             22322 S -
> bash            22471 S -
> man             22817 S wait
> sh              22820 S wait
> sh              22821 S wait
> less            22826 S -
> bash            22839 S wait
> screen          23435 S pause
> [...]
>
> Regards,
> Bryn.
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany





More information about the Linux-cluster mailing list