[Linux-cluster] ccs/ricci cluster operation design

Tue Aug 9 00:43:47 UTC 2011

Let me know your thoughts on the ccs/ricci cluster operation design. 

The bottom line is that it's a bad design to get the failed node to join the cluster automatically, and I think ccs/ricci should have options (in additon to --start/--stop) which just starts/stops services and doesn't change the chkconfig status. 

Here is the details of the problem:

You can start/stop the cluster with ccs --start/--stop, but my customer cannot adopt it from the following reason.

In the customer's cluster:

- They start/stop the cluster with starting/stopping the services directly.(Not using ccs/ricci interface at the moment.)
- They set chkconfig off for the cluster services (cman, rgmanger etc.)
- They force-reboot the failed node with the fence device.

In this setting, when a node is force-rebooted with some problem such as kernel panic, for example, the node doesn't automatically join the cluster. Then the customer logs-in to the node and investigates the problem. When they are sure that the problem is resolved, they start the cluster services on this node again.

Now, the problem is that this customer cannot adopt the ccs tool for the cluster operation. Under the ccs operation, when the failed node is
force-rebooted, it automatically tries to join the cluster as chkconfig is on although the potential problem is not yet investigated and resolved by the customer. 

Here's the related discussion on bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=728041

-- Etsuji