[Linux-cluster] Error: ClientSocket(String): connect() failed: No such file or directory

Mon Jun 22 17:26:14 UTC 2015

Hello Megan,

On 04/06/15 08:23 -0400, Megan . wrote:
> On Wed, Jun 3, 2015 at 10:31 AM, Megan . <nagemnna at gmail.com> wrote:

[...]

> FYI - i talked to our network folks and it looks like they were doing some
> testing last night with port failover which may or may not have caused this

unlikely, unless you were "lucky" enough to contact a different
actual machine under the network address than you intended or if
modclusterd was fragile enough to break on these intermittent
changes (not exactly sure what you mean with "port failover" TBH).

Indicated error:
>> Error: ClientSocket(String): connect() failed: No such file or directory

means that modclusterd on particular node was not running (by itself,
this is still OK) and it could not be started within 8 seconds, which is
what modcluster (ricci's helper, but from clustermon package) tries to
do if the socket /var/run/clumond.sock (indication of running modclusterd)
cannot be reached (for whatever reason, including SELinux, but that
should be OK as well).

So if the problem recidivates, definitely check the troubling node if:
- modclusterd service is running and/or is able to start (provide
  /var/run/clumond.sock socket) within 5 seconds or so under the
  typical workload (may be subtle in virtualized environment)
- when modclusterd is started, /var/run/clumond.sock exists and has
  the expected properties (file-like socket, expected permissions)
- SELinux (if enabled) audit contains any clumond.sock or modclusterd
  reference

> issue.  However, I was able to correct it by fencing the problem nodes.

Provided that those "port failover" shakes were settled down by that
time, perhaps modclusterd just started to be happy again and not
failing anymore if it was the case previously.

>> Anybody ever seen "Error: ClientSocket(String): connect() failed: No such
>> file or directory" when doing a start all?  Something seems to have
>> broken with our closer.  Our UAT setup works as expected.  I looked at
>> tcpdumps the best that i could (i'm not a network person though) and i
>> didn't see anything obvious.  I shutdown iptables on all nodes.

FWIW, most if not all of the packet sniffing tools cannot hook into local
file-like sockets.

>> We are running Centos 6,6, ccs-0.16.2-75.el6_6.1.x86_64

Good, this excluded all known (and fixed!) bugs preventing modclusterd
from operation (IPv4-only environment, huge cluster.conf).

>> cman-3.0.12.1-68.el6.x86_64.  We have a 12 node cluster in production that
>> allows us to share gfs2 iscsi mounts.  no other services are used.  clvmd
>> -R runs fine at this time.  ccs -h node --sync --activate also runs fine.
>> 
>> 
>> [root at admin1 ~]# ccs -h admin1-ops --startall
>> Unable to start map1-ops, possibly due to lack of quorum, try --startall
>> Error: ClientSocket(String): connect() failed: No such file or directory
>> Started cache2-ops
>> Unable to start data1-ops, possibly due to lack of quorum, try --startall
>> Error: ClientSocket(String): connect() failed: No such file or directory
>> Started map2-ops
>> Unable to start archive1-ops, possibly due to lack of quorum, try
>> --startall
>> Error: ClientSocket(String): connect() failed: No such file or directory
>> Started data3-ops
>> Started mgmt1-ops
>> Unable to start admin1-ops, possibly due to lack of quorum, try --startall
>> Error: ClientSocket(String): connect() failed: No such file or directory
>> Started data2-ops
>> Started cache1-ops

The out-of-context, hilarious hint (use --startall when you actually
do) led me to file a bug: <https://bugzilla.redhat.com/1234515>.
Thanks for indirectly showing this off!

-- 
Jan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150622/a81baefb/attachment.sig>