[Linux-cluster] CentOS 4.8, nfs, and fence every 4 days
Eric Schneider
eschneid at uccs.edu
Tue Apr 27 01:31:55 UTC 2010
2 node CentOS 4.8 cluster on ESX 4 cluster (cluster across boxes)
[root at host ~]# uname -a
Linux hostname 2.6.9-89.0.19.ELlargesmp
2 GB RAM
2 vCPU
1 200 GB RDM - GFS1
VMware fencing
Member Status: Quorate
Member Name Status
------ ---- ------
Host1 Online, Local, rgmanager
Host2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
www-http host1 started
www-nfs host2 started
vhostip-http host2 started
vhost-http host2 started
[root at host ~]# rpm -qa | grep cman
cman-kernel-2.6.9-56.7.el4_8.10
cman-kernel-smp-2.6.9-56.7.el4_8.10
cman-devel-1.0.24-1
cman-kernel-largesmp-2.6.9-56.7.el4_8.10
cman-1.0.24-1
cman-kernheaders-2.6.9-56.7.el4_8.10
/var/log/messages
Apr 26 18:45:32 tesla kernel: oom-killer: gfp_mask=0xd0
Apr 26 18:45:32 tesla kernel: Mem-info:
Apr 26 18:45:32 tesla kernel: Node 0 DMA per-cpu:
Apr 26 18:45:32 tesla kernel: cpu 0 hot: low 2, high 6, batch 1
Apr 26 18:45:32 tesla kernel: cpu 0 cold: low 0, high 2, batch 1
Apr 26 18:45:32 tesla kernel: cpu 1 hot: low 2, high 6, batch 1
Apr 26 18:45:32 tesla kernel: cpu 1 cold: low 0, high 2, batch 1
Apr 26 18:45:32 tesla kernel: Node 0 Normal per-cpu:
Apr 26 18:45:32 tesla kernel: cpu 0 hot: low 32, high 96, batch 16
Apr 26 18:45:32 tesla kernel: cpu 0 cold: low 0, high 32, batch 16
Apr 26 18:45:32 tesla kernel: cpu 1 hot: low 32, high 96, batch 16
Apr 26 18:45:32 tesla kernel: cpu 1 cold: low 0, high 32, batch 16
Apr 26 18:45:32 tesla kernel: Node 0 HighMem per-cpu: empty
Apr 26 18:45:32 tesla kernel:
Apr 26 18:45:32 tesla kernel: Free pages: 6352kB (0kB HighMem)
Apr 26 18:45:32 tesla kernel: Active:3245 inactive:3129 dirty:0 writeback:0
unstable:0 free:1588 slab:499421 mapped:4514 pagetables:914
Apr 26 18:45:32 tesla kernel: Node 0 DMA free:752kB min:44kB low:88kB
high:132kB active:0kB inactive:0kB present:15996kB pages_scanned:0
all_unreclaimable? yes
Apr 26 18:45:32 tesla kernel: protections[]: 0 286000 286000
Apr 26 18:45:32 tesla kernel: Node 0 Normal free:5600kB min:5720kB
low:11440kB high:17160kB active:12980kB inactive:12516kB present:2080704kB
pages_scanned:20031 all_unreclaimable? yes
Apr 26 18:45:32 tesla kernel: protections[]: 0 0 0
Apr 26 18:45:32 tesla kernel: Node 0 HighMem free:0kB min:128kB low:256kB
high:384kB active:0kB inactive:0kB present:0kB pages_scanned:0
all_unreclaimable? no
Apr 26 18:45:32 tesla kernel: protections[]: 0 0 0
Apr 26 18:45:32 tesla kernel: Node 0 DMA: 4*4kB 4*8kB 2*16kB 3*32kB 3*64kB
1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 752kB
Apr 26 18:45:32 tesla kernel: Node 0 Normal: 0*4kB 0*8kB 0*16kB 1*32kB
1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5600kB
Apr 26 18:45:32 tesla kernel: Node 0 HighMem: empty
Apr 26 18:45:32 tesla kernel: 6192 pagecache pages
Every 4 days the host2 system (running NFS service) starts running
oom-killer, goes brain dead, and gets fenced. The http processes are
restarted every morning at 4:00 AM for log rotates so I don't think they are
the problem.
Attempts to fix:
http://kbase.redhat.com/faq/docs/DOC-3993
http://kbase.redhat.com/faq/docs/DOC-7317
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US
<http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=di
splayKC&externalId=1002704> &cmd=displayKC&externalId=1002704
Release Found: Red Hat Enterprise Linux 4 Update 4
Symptom:
The command top shows a lot of memory is being cached and swap is hardly
being used.
Solution:
On Red Hat Enterprise Release 4 Update 4, a workaround to the oom killer
kills random processess while there is still memory available, is to issue
the following commend:
This will cause page reclamation to happen sooner, thus providing more
'protection' for the zones.
Changes to Tesla :
[root at host ~]# echo 100 > /proc/sys/vm/lower_zone_protection
Anybody have any ideas?
Thanks,
Eric
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100426/9fc17283/attachment.htm>
More information about the Linux-cluster
mailing list