[dm-devel] Multipath failover issues

Tue Mar 17 12:30:00 UTC 2009

Hi,

>> 257 Critical 2009-03-11 10:38:43 ALERT:Redundant Controller Failure
>> Detected (Slot B)
>>
>> I also found additional logs from /var/log/messages which i did not 
>> check earlier.
>>
>> Mar 11 10:32:46 multipathd: sdc: readsector0 checker reports path is down
>> Mar 11 10:32:46 multipathd: checker failed path 8:32 in map infortrend01
>> Mar 11 10:32:46 multipathd: infortrend01: remaining active paths: 1
>> Mar 11 10:32:46 multipathd: sdd: readsector0 checker reports path is down
>> Mar 11 10:32:46 multipathd: checker failed path 8:48 in map infortrend01
> 
> Does this timing correspond to when you turned off the controller ?

This is when the controller failed. The controller shutdown happened 
much later.

>> Iam assuming it must have been busy for a few secs during the switch 
>> over and the multipath config doesn't wait enough for the switchover to 
>> work.
> 
> Answer to your previous question would help here :)
> 
> Set no_path_retry to "queue", which would queue the I/Os when "all" the
> paths fail.

Iam not sure if i can do this as well. Aren't we creating an illusion 
that the storage subsystem is fine and queuing requests when actually 
the subsystem is gone ? What actually is done for queuing and there must 
be some limits for the queue as well right ?

> If the behavior seen above was caused by the storage and will be
> rectified in an acceptable (to the user) time, then this parameter
> setting would solve your problem.

Iam checking this with infortrend.

> BTW, have you seen the I/O successfully been sent to the lun (both paths
> - you can use iostat to check it) before you failed the controller ? (I
> am trying to see if your config settings are proper).

Iam doing a post mortem of the redundant controller failure here :). I 
dug out what was done after the controller failure.

* Primary Controller failed and failover to secondary did not work
* Multipath failed both paths and ext3 went read only
* Postgres crashed
* When they logged in and ran (multipath -v2 -ll), they saw both paths 
active - I cannot find any multipath log entries which shows paths 
reinstated until 11:50 - which was after controller shutdown and power 
cycle.
* The filesystem was mounted again (without fsck) and database started 
(This answers your question abt IO to the LUNs i think)
* Postgres recovered and was shutdown immediately and /data unmounted.
* After this the controllers on the infotrend was shutdown and the 
device power cycled.

PS : Iam digging up the entire multipath logs instead of posting 
snippets here - will add to pastebin and send the link over

TIA
Dushyanth