[Linux-cluster] CentOS 4.8, nfs, and fence every 4 days

Tue Apr 27 01:31:55 UTC 2010

2 node CentOS 4.8 cluster on ESX 4 cluster (cluster across boxes)

[root at host ~]# uname -a

Linux hostname 2.6.9-89.0.19.ELlargesmp

2 GB RAM

2 vCPU

1 200 GB RDM - GFS1

VMware fencing

Member Status: Quorate

  Member Name                              Status

  ------ ----                              ------

  Host1                                    Online, Local, rgmanager

  Host2                                    Online, rgmanager

  Service Name         Owner (Last)                   State

  ------- ----         ----- ------                   -----

  www-http             host1                          started

  www-nfs              host2                          started

  vhostip-http         host2                          started

  vhost-http           host2                          started

[root at host ~]# rpm -qa | grep cman

cman-kernel-2.6.9-56.7.el4_8.10

cman-kernel-smp-2.6.9-56.7.el4_8.10

cman-devel-1.0.24-1

cman-kernel-largesmp-2.6.9-56.7.el4_8.10

cman-1.0.24-1

cman-kernheaders-2.6.9-56.7.el4_8.10

/var/log/messages

Apr 26 18:45:32 tesla kernel: oom-killer: gfp_mask=0xd0

Apr 26 18:45:32 tesla kernel: Mem-info:

Apr 26 18:45:32 tesla kernel: Node 0 DMA per-cpu:

Apr 26 18:45:32 tesla kernel: cpu 0 hot: low 2, high 6, batch 1

Apr 26 18:45:32 tesla kernel: cpu 0 cold: low 0, high 2, batch 1

Apr 26 18:45:32 tesla kernel: cpu 1 hot: low 2, high 6, batch 1

Apr 26 18:45:32 tesla kernel: cpu 1 cold: low 0, high 2, batch 1

Apr 26 18:45:32 tesla kernel: Node 0 Normal per-cpu:

Apr 26 18:45:32 tesla kernel: cpu 0 hot: low 32, high 96, batch 16

Apr 26 18:45:32 tesla kernel: cpu 0 cold: low 0, high 32, batch 16

Apr 26 18:45:32 tesla kernel: cpu 1 hot: low 32, high 96, batch 16

Apr 26 18:45:32 tesla kernel: cpu 1 cold: low 0, high 32, batch 16

Apr 26 18:45:32 tesla kernel: Node 0 HighMem per-cpu: empty

Apr 26 18:45:32 tesla kernel:

Apr 26 18:45:32 tesla kernel: Free pages:        6352kB (0kB HighMem)

Apr 26 18:45:32 tesla kernel: Active:3245 inactive:3129 dirty:0 writeback:0
unstable:0 free:1588 slab:499421 mapped:4514 pagetables:914

Apr 26 18:45:32 tesla kernel: Node 0 DMA free:752kB min:44kB low:88kB
high:132kB active:0kB inactive:0kB present:15996kB pages_scanned:0
all_unreclaimable? yes

Apr 26 18:45:32 tesla kernel: protections[]: 0 286000 286000

Apr 26 18:45:32 tesla kernel: Node 0 Normal free:5600kB min:5720kB
low:11440kB high:17160kB active:12980kB inactive:12516kB present:2080704kB
pages_scanned:20031 all_unreclaimable? yes

Apr 26 18:45:32 tesla kernel: protections[]: 0 0 0

Apr 26 18:45:32 tesla kernel: Node 0 HighMem free:0kB min:128kB low:256kB
high:384kB active:0kB inactive:0kB present:0kB pages_scanned:0
all_unreclaimable? no

Apr 26 18:45:32 tesla kernel: protections[]: 0 0 0

Apr 26 18:45:32 tesla kernel: Node 0 DMA: 4*4kB 4*8kB 2*16kB 3*32kB 3*64kB
1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 752kB

Apr 26 18:45:32 tesla kernel: Node 0 Normal: 0*4kB 0*8kB 0*16kB 1*32kB
1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5600kB

Apr 26 18:45:32 tesla kernel: Node 0 HighMem: empty

Apr 26 18:45:32 tesla kernel: 6192 pagecache pages

Every 4 days the host2 system (running NFS service) starts running
oom-killer, goes brain dead, and gets fenced.  The http processes are
restarted every morning at 4:00 AM for log rotates so I don't think they are
the problem.

Attempts to fix:

http://kbase.redhat.com/faq/docs/DOC-3993

http://kbase.redhat.com/faq/docs/DOC-7317

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US
<http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=di
splayKC&externalId=1002704> &cmd=displayKC&externalId=1002704

Release Found: Red Hat Enterprise Linux 4 Update 4

 Symptom:

The command top shows a lot of memory is being cached and swap is hardly
being used.

Solution:

On Red Hat Enterprise Release 4 Update 4, a workaround to the oom killer
kills random processess while there is still memory available, is to issue
the following commend:

This will cause page reclamation to happen sooner, thus providing more
'protection' for the zones.

Changes to Tesla :
[root at host ~]# echo 100 > /proc/sys/vm/lower_zone_protection

Anybody have any ideas?

Thanks,

Eric 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100426/9fc17283/attachment.htm>