[Linux-cluster] problem with deadlocked processes (D)
Mark Hlawatschek
hlawatschek at atix.de
Wed Apr 4 13:58:41 UTC 2007
Hi,
I observed quite the same problem at some time.
There's the bugzilla entry I opened:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=228916
Mark
On Wednesday 04 April 2007 15:18:18 Peter Sopko wrote:
> Hi,
>
> thanks for your reply Bryn.
>
> The output of the ps command you suggested (i've ommited the standard
> system processes) :
>
> [root at mail1 subsys]# ps ax -ocomm,pid,state,wchan |more
> COMMAND PID S WCHAN
> ccsd 2258 S -
> cman_comms 2310 S cluster_kthread
> cman_serviced 2312 S serviced
> cman_memb 2311 S membership_kthread
> cman_hbeat 2315 S hello_kthread
> fenced 2336 S rt_sigsuspend
> dlm_astd 2354 S dlm_astd
> dlm_recvd 2355 S dlm_recvd
> dlm_sendd 2356 S dlm_sendd
> lock_dlm1 2358 S dlm_async
> lock_dlm2 2359 S dlm_async
> gfs_scand 2360 S -
> gfs_glockd 2361 S gfs_glockd
> gfs_recoverd 2362 S -
> gfs_logd 2363 S -
> gfs_quotad 2364 D glock_wait_internal
> gfs_inoded 2365 D dlm_lock_sync
> syslogd 2374 S -
> klogd 2394 S syslog
> heartbeat 2503 S -
> courierlogger 2526 S pipe_wait
> authdaemond 2527 S -
> authdaemond 2551 S -
> authdaemond 2552 S -
> authdaemond 2553 S -
> authdaemond 2554 S -
> authdaemond 2555 S -
> heartbeat 2586 S pipe_wait
> heartbeat 2587 S -
> heartbeat 2588 S -
> heartbeat 2589 S -
> heartbeat 2590 S -
> acpid 2595 S -
> ipfail 2608 S -
> nod32d 2609 S -
> nod32smtp 2618 S -
> sshd 2627 S -
> ntpd 2642 S -
> courierlogger 2654 S pipe_wait
> couriertcpd 2655 S -
> courierlogger 2661 S pipe_wait
> couriertcpd 2662 S -
> courierlogger 2667 S pipe_wait
> couriertcpd 2668 S wait
> courierlogger 2673 S pipe_wait
> couriertcpd 2674 S -
> master 2815 S -
> master 3024 S -
> httpd 3039 S -
> crond 3048 S -
> rhnsd 3067 S -
> mingetty 3074 S -
> mingetty 3075 S -
> mingetty 3076 S -
> mingetty 3077 S -
> mingetty 3078 S -
> mingetty 3079 S -
> ntpd 3888 S rt_sigsuspend
> tlsmgr 4544 S -
> tlsmgr 1585 S -
> anvil 1699 S -
> spamd 29941 S -
> httpd 15674 D glock_wait_internal
> httpd 15675 D glock_wait_internal
> httpd 15676 D glock_wait_internal
> httpd 15677 D glock_wait_internal
> httpd 15678 D glock_wait_internal
> httpd 15679 D glock_wait_internal
> httpd 15680 D glock_wait_internal
> httpd 15681 D glock_wait_internal
> httpd 30808 D glock_wait_internal
> httpd 30809 D glock_wait_internal
> httpd 30810 D glock_wait_internal
> httpd 30825 D glock_wait_internal
> httpd 30827 D glock_wait_internal
> httpd 30828 D glock_wait_internal
> httpd 30829 D glock_wait_internal
> httpd 30830 D glock_wait_internal
> httpd 30831 D glock_wait_internal
> httpd 30832 D glock_wait_internal
> httpd 30835 D glock_wait_internal
> httpd 30840 D glock_wait_internal
> spamd 17341 S -
> proxymap 24868 S -
> proxymap 27542 S -
> mysqld_safe 30617 S wait
> mysqld 30650 S -
> trivial-rewrite 30735 S -
> proxymap 30742 S -
> sshd 517 S -
> sshd 519 S -
> bash 520 S wait
> su 740 S wait
> bash 741 S -
> imapd 15018 D lock_on_glock
> virtual 15699 D lock_on_glock
> trivial-rewrite 15918 S -
> proxymap 15922 S -
> virtual 15943 D lock_on_glock
> virtual 15952 D lock_on_glock
> virtual 15965 D lock_on_glock
> pop3d 15966 D lock_on_glock
> pop3d 15967 D lock_on_glock
> virtual 15968 D lock_on_glock
> pop3d 15971 D lock_on_glock
> pop3d 15983 D lock_on_glock
> virtual 16046 D lock_on_glock
> pop3d 16049 D lock_on_glock
> pop3d 16053 D lock_on_glock
> pop3d 16068 D glock_wait_internal
> pop3d 16074 D glock_wait_internal
> virtual 16077 D lock_on_glock
> spamd 16112 S -
> virtual 16129 D lock_on_glock
> virtual 16133 D lock_on_glock
> pop3d 16143 D glock_wait_internal
> virtual 16153 D lock_on_glock
> virtual 16160 D glock_wait_internal
> virtual 16163 D lock_on_glock
> pop3d 16164 D glock_wait_internal
> virtual 16179 D lock_on_glock
> pop3d 16183 D glock_wait_internal
> pop3d 16186 D glock_wait_internal
> pop3d 16187 D glock_wait_internal
> virtual 16191 D lock_on_glock
> pop3d 16192 D lock_on_glock
> virtual 16194 D lock_on_glock
> pop3d 16202 D glock_wait_internal
> virtual 16207 D lock_on_glock
> virtual 16217 D lock_on_glock
> virtual 16222 D lock_on_glock
> ....
> smtp 21150 S -
> smtp 21162 S flock_lock_file_wait
> cleanup 21181 S flock_lock_file_wait
> smtpd 21213 S -
> spamfilter.sh 21224 S wait
> cat 21225 S pipe_wait
> spamfilter.sh 21226 D -
> spamfilter.sh 21229 S wait
> pipe 21230 S -
> cat 21231 S pipe_wait
> spamfilter.sh 21232 D -
> spamfilter.sh 21235 S wait
> cat 21236 S pipe_wait
> spamfilter.sh 21237 D -
> spamfilter.sh 21239 S wait
> spamfilter.sh 21240 S wait
> cat 21242 S pipe_wait
> spamfilter.sh 21243 D -
> virtual 21244 D lock_on_glock
> cat 21245 S pipe_wait
> spamfilter.sh 21246 D -
> spamfilter.sh 21249 S wait
> cat 21250 S pipe_wait
> spamfilter.sh 21251 D -
> spamfilter.sh 21252 S wait
> cat 21253 S pipe_wait
> spamfilter.sh 21254 D -
> spamfilter.sh 21257 S wait
> cat 21258 S pipe_wait
> spamfilter.sh 21259 D -
> spamfilter.sh 21261 S wait
> spamfilter.sh 21262 S wait
> spamfilter.sh 21263 S wait
> cat 21264 S pipe_wait
> spamfilter.sh 21265 D -
> spamfilter.sh 21267 D -
> cat 21268 S pipe_wait
> spamfilter.sh 21269 D -
> spamfilter.sh 21273 S wait
> ...
> etc....
>
>
> The sysrq-t output is to be found on this url -
> http://www.backbone.sk/sysrq.tar. It's 400k in size, so I have chosen not
> to attach it as in here. There are two files in this .tar - one was taken
> 15:04 and the other one on 15:08.
>
> Again I will be very thankful for any help.
>
> Peter Sopko, IT Security Consultant
> Tempest a.s.
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bryn M. Reeves
> Sent: Wednesday, April 04, 2007 2:45 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] problem with deadlocked processes (D)
>
> Peter Sopko wrote:
> > Hi,
> >
> > today a strange thing occurred - on both of our cluster nodes a lot of
> > processes suddenly started to become locked in the D state (i/o lock).
>
> This
>
> > thing has already happened once before (six months ago), but a simple
>
> reboot
>
> > helped to solve this issue. But as it appeared again, I don't want to
>
> solve
>
> > it this way again, I would like to find the reason why this is happening,
> > but have no idea where to start. In /var/log/messages there is nothing
> > unusual, the only thing is that some directories are unremoveable and a
>
> lot
>
> > of processes locked.
>
> For problems where processes are getting stuck in D state it's usually
> helpful to get sysrq-t data to see where the threads are stuck. Grab two
> sets of data a few seconds apart so that you can see if things are
> really stuck or just making slow progress.
>
> You can also get some information from the wchan data exposed in /proc -
> it's easiest to view with ps:
>
> $ ps ax -ocomm,pid,state,wchan
> COMMAND PID S WCHAN
> vim 22322 S -
> bash 22471 S -
> man 22817 S wait
> sh 22820 S wait
> sh 22821 S wait
> less 22826 S -
> bash 22839 S wait
> screen 23435 S pause
> [...]
>
> Regards,
> Bryn.
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
--
Gruss / Regards,
Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/
**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany
More information about the Linux-cluster
mailing list