[Linux-cluster] GFS2 with IMAP Maildir server

Fri Jul 3 19:40:11 UTC 2009

Sounds like you are running into the same bug that I ran into with GFS2 
on a similar setup nearly 2 years ago, except I could produce a lock-up 
in under 2 seconds every time. Solution is to use GFS1 if you really 
want to stick with that setup, but bear in mind that, regardless of the 
cluster file system (GFS1, GFS2, OCFS2) the performance will scale 
_inversely_. Cluster file systems really don't work well with millions 
of small files.

You might, instead, want to look into something like DBMail with a MySQL 
proxy to serialize all writes to a single node.

You can, of course, still use GFS1 for the root file system to share the 
OS install. Look at Open Shared Root project if this is of interest.

Gordan

Flavio Junior wrote:
> Hi folks....
> 
> I'm (trying to) using GFS2 with a mailserver scenario using:
> 
> - CentOS 5.3 updated
> - Dovecot IMAP/Maildir
> - Postfix
> 
> To make servers active/active i'm using CTDB (http://ctdb.samba.org).
> 
> Some info that could be relevant:
> [root at pinky ~]# uname -a
> Linux pinky 2.6.18-128.1.16.el5 #1 SMP Tue Jun 30 06:07:26 EDT 2009 
> x86_64 x86_64 x86_64 GNU/Linux
> [root at pinky ~]# rpm -qa | grep -E 'gfs2|clust|kernel|cman|openais'
> kernel-2.6.18-128.1.16.el5
> gfs2-utils-0.1.53-1.el5_3.3
> modcluster-0.12.1-2.el5.centos
> cluster-cim-0.12.1-2.el5.centos
> kernel-devel-2.6.18-128.1.10.el5
> openais-0.80.3-22.el5_3.8
> system-config-cluster-1.0.55-1.0
> kernel-2.6.18-128.1.6.el5
> kernel-2.6.18-128.1.10.el5
> kernel-devel-2.6.18-128.1.16.el5
> lvm2-cluster-2.02.40-7.el5
> cluster-snmp-0.12.1-2.el5.centos
> kernel-headers-2.6.18-128.1.16.el5
> kernel-devel-2.6.18-128.1.6.el5
> cman-2.0.98-1.el5_3.4
> [root at pinky ~]# grep /home /etc/fstab
> /dev/homeClusterVG/home_vmail   /home           gfs2    
> auto,noatime,quota=off,noexec,nodev,_netdev       0 0
> 
> 
> Everything works fine for some time, but two or three times by day I get 
> some dovecot/deliver process hanged D state, so the only way to solve it 
> is rebooting node.
> 
> I'm not a developer and don't know much about debugging. As i've got 
> other problems ago I learn to use "sysrq-t" and here is the output 
> related with two of these process:
> 
> Pastebin: http://pastebin.ca/1483264
> 
> Jul  3 15:45:20 cerebro kernel: deliver       D ffff81007e442800     0 
> 24420  23846                     (NOTLB)
> Jul  3 15:45:20 cerebro kernel:  ffff810013885e08 0000000000000082 
> ffff810013885d68 0000000000000092
> Jul  3 15:45:20 cerebro kernel:  ffff810013885e20 0000000000000001 
> ffff8100141870c0 ffff81000904b0c0
> Jul  3 15:45:20 cerebro kernel:  0000052a72ff2a70 000000000000034a 
> ffff8100141872a8 000000036caf5000
> Jul  3 15:45:20 cerebro kernel: Call Trace:
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff88562a7d>] 
> :dlm:dlm_posix_lock+0x172/0x210
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff8009eba4>] 
> autoremove_wake_function+0x0/0x2e
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff88591c7a>] 
> :gfs2:gfs2_lock+0xc3/0xcf
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff8003a39e>] 
> fcntl_setlk+0x11e/0x273
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff800b5659>] 
> audit_syscall_entry+0x16e/0x1a1
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0
> 
> 
> Jul  3 15:45:21 cerebro kernel: deliver       D ffff81000238f480     0  
> 1358  32225                     (NOTLB)
> Jul  3 15:45:21 cerebro kernel:  ffff8100086cfe08 0000000000000082 
> ffff8100086cfd68 0000000000000092
> Jul  3 15:45:21 cerebro kernel:  ffff8100086cfe20 0000000000000001 
> ffff81000904b0c0 ffff81007ff28100
> Jul  3 15:45:21 cerebro kernel:  0000052a72ff2ca2 0000000000000232 
> ffff81000904b2a8 000000037ed68a00
> Jul  3 15:45:21 cerebro kernel: Call Trace:
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff88562a7d>] 
> :dlm:dlm_posix_lock+0x172/0x210
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff8009eba4>] 
> autoremove_wake_function+0x0/0x2e
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff88591c7a>] 
> :gfs2:gfs2_lock+0xc3/0xcf
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff8003a39e>] 
> fcntl_setlk+0x11e/0x273
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff800b5659>] 
> audit_syscall_entry+0x16e/0x1a1
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0
> 
> 
> Before reboot the node I went into the directory of this user and run 
> some "ls" and everything works as expected. I was pretty sure that 
> command will hang, but it don't.
> Here is the "ps ax" output:
> cicero   24420  0.0  0.0   8960  1220 ?        Ds   14:46   0:00 
> /usr/libexec/dovecot/deliver -f cicero -d cicero
> 
> I've already rebooted that node, but if there is someway more deeply to 
> perform a debug of this case, just let me know that probably till the 
> end of the day i'll get same situation.
> 
> 
> Thanks in advance.
> 
> --
> 
> Flávio do Carmo Júnior aka waKKu
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster