[Linux-cluster] RHEL3 Cluster network hangup
Lon Hohberger
lhh at redhat.com
Thu Jul 7 15:49:17 UTC 2005
On Wed, 2005-07-06 at 08:30 +0200, Gunther Schlegel wrote:
> The clustered application does a lot of printing (lprng),
> faxing(hylafax) and mailing(sendmail). It uses shell scripts to pass the
> jobs to the operating systems daemons.
> The client programs of these daemons, which pass jobs to the daemons
> using network connections to localhost start to behave irregular when
> the cluster is up for about 2 weeks.
> Examples:
> - hylafax faxstat stops listing the transmitted faxes in the middle of
> the list ( but always at the same job )
> - sendmail opens a connection to the local daemon but does not transfer
> the message. Both processes sit there and wait, after some time the
> server closes the connection because of missing input from the clients side.
> - same with lpr.
>
> I assume that something locks up in the ip stack. Not all services are
> affected at the same time.
>
> I guess this is related to the cluster software as we run that
> application on a lot of servers which all do not show this behaviour and
> that are all not clustered.
I doubt it, but it's not out of the realm of possibility. The cluster
software does three things mostly:
(a) figures out who's online
(b) shoots nodes
(c) manages services using shell scripts
The shell scripts call standard utilities (ifconfig, route, etc.).
Now -- here's the thing. Earlier versions of clumanager (<1.2.22) had a
problem where sometimes (and randomly!), services would get a bogus
status return and restart on the same node. Also, the most recent
errata fixed a signal handling problem which broke JVMs from running
under it. Either of these may have caused the problems on your cluster,
I don't know. The former would have associated log messages; the latter
wouldn't.
I'd try the latest release from RHN (clumanager-1.2.26.1-1).
If that doesn't work, I'd call Red Hat Support...
-- Lon
More information about the Linux-cluster
mailing list