[dm-devel] Multipath failover issues
Dushyanth Harinath
dushyanth.h at directi.com
Tue Mar 17 12:30:00 UTC 2009
Hi,
>> 257 Critical 2009-03-11 10:38:43 ALERT:Redundant Controller Failure
>> Detected (Slot B)
>>
>> I also found additional logs from /var/log/messages which i did not
>> check earlier.
>>
>> Mar 11 10:32:46 multipathd: sdc: readsector0 checker reports path is down
>> Mar 11 10:32:46 multipathd: checker failed path 8:32 in map infortrend01
>> Mar 11 10:32:46 multipathd: infortrend01: remaining active paths: 1
>> Mar 11 10:32:46 multipathd: sdd: readsector0 checker reports path is down
>> Mar 11 10:32:46 multipathd: checker failed path 8:48 in map infortrend01
>
> Does this timing correspond to when you turned off the controller ?
This is when the controller failed. The controller shutdown happened
much later.
>> Iam assuming it must have been busy for a few secs during the switch
>> over and the multipath config doesn't wait enough for the switchover to
>> work.
>
> Answer to your previous question would help here :)
>
> Set no_path_retry to "queue", which would queue the I/Os when "all" the
> paths fail.
Iam not sure if i can do this as well. Aren't we creating an illusion
that the storage subsystem is fine and queuing requests when actually
the subsystem is gone ? What actually is done for queuing and there must
be some limits for the queue as well right ?
> If the behavior seen above was caused by the storage and will be
> rectified in an acceptable (to the user) time, then this parameter
> setting would solve your problem.
Iam checking this with infortrend.
> BTW, have you seen the I/O successfully been sent to the lun (both paths
> - you can use iostat to check it) before you failed the controller ? (I
> am trying to see if your config settings are proper).
Iam doing a post mortem of the redundant controller failure here :). I
dug out what was done after the controller failure.
* Primary Controller failed and failover to secondary did not work
* Multipath failed both paths and ext3 went read only
* Postgres crashed
* When they logged in and ran (multipath -v2 -ll), they saw both paths
active - I cannot find any multipath log entries which shows paths
reinstated until 11:50 - which was after controller shutdown and power
cycle.
* The filesystem was mounted again (without fsck) and database started
(This answers your question abt IO to the LUNs i think)
* Postgres recovered and was shutdown immediately and /data unmounted.
* After this the controllers on the infotrend was shutdown and the
device power cycled.
PS : Iam digging up the entire multipath logs instead of posting
snippets here - will add to pastebin and send the link over
TIA
Dushyanth
More information about the dm-devel
mailing list