[dm-devel] Media failures cause Path/Device Failures in dm-multipath ?

Tue Jul 7 21:26:45 UTC 2009

I have a configuration that has raid1 mirrors (md_raid1) built on top of
linear segments of multipath'd scsi disks (dm-multipath).  This is Linux
2.6.27.25, FYI.  Unfortunately because this is an embedded environment
it's not easy for us to jump to a newer kernel.

In this configuration when a scsi disk reports a media failure (SCSI
Sense/ASC/ASCQ: 3/11/0), I would expect that the md_raid1 would be able
to handle the error and read the data from the other mirror and then
re-write the failed sector on the original image.

I have tried this with the no_path_retry attribute as 'fail'  and
observe the following... 

dm-multipath reports the path failure. 
Then it tries the request on the other path, which also gets a path
failure.
The failure of both paths fails the device.

md-raid1 gets the error and reads from the other mirror.
When md-raid1 tries to re-write the data it encounters the failed
device.

>From syslog:

Jul 1 20:46:27 hostname user.info kernel: sd 3:0:3:0: [sdaz] Result: hostbyte=0x00 driverbyte=0x08
Jul 1 20:46:27 hostname user.info kernel: sd 3:0:3:0: [sdaz] Sense Key : 0x3 [current]
Jul 1 20:46:27 hostname user.warn kernel: Info fld=0x217795c
Jul 1 20:46:27 hostname user.info kernel: sd 3:0:3:0: [sdaz] ASC=0x11 ASCQ=0x0
Jul 1 20:46:27 hostname user.warn kernel: device-mapper: multipath: Failing path 67:48.
Jul 1 20:46:27 hostname daemon.notice multipathd: 67:48: mark as failed
Jul 1 20:46:27 hostname daemon.notice multipathd: encl3Slot4: remaining active paths: 1
Jul 1 20:46:29 hostname user.info kernel: sd 2:0:29:0: [sdab] Result: hostbyte=0x00 driverbyte=0x08
Jul 1 20:46:29 hostname user.info kernel: sd 2:0:29:0: [sdab] Sense Key : 0x3 [current]
Jul 1 20:46:29 hostname user.warn kernel: Info fld=0x217795c
Jul 1 20:46:29 hostname user.info kernel: sd 2:0:29:0: [sdab] ASC=0x11 ASCQ=0x0
Jul 1 20:46:29 hostname user.warn kernel: device-mapper: multipath: Failing path 65:176.
Jul 1 20:46:30 hostname daemon.notice multipathd: 65:176: mark as failed
Jul 1 20:46:30 hostname daemon.notice multipathd: encl3Slot4: remaining active paths: 0
Jul 1 20:46:30 hostname user.err kernel: raid1: dm-36: rescheduling sector 35092739
Jul 1 20:46:30 hostname user.alert kernel: raid1: Disk failure on dm-38, disabling device.
Jul 1 20:46:30 hostname user.alert kernel: raid1: Operation continuing on 1 devices.
Jul 1 20:46:30 hostname user.warn kernel: md: super_written gets error=-5, uptodate=0
Jul 1 20:46:30 hostname user.alert kernel: raid1: Disk failure on dm-36, disabling device.
Jul 1 20:46:30 hostname user.alert kernel: raid1: Operation continuing on 1 devices.

When the no_path_retry attribute is set to '3' :

dm-multipath reports the path failure.
Then retries the request on the other path, which also gets a path
failure.
On the failure of the second path, the device queues, and enters
recovery mode.
On the subsequent poll of the paths they are reinstated and the IOs are
'resumed'.....
... and of course fail with the media error again.......
causing a hang...

>From syslog:

Jul  7 10:54:35 hostname user.info kernel: sd 2:0:19:0: [sds] Result: hostbyte=0x00 driverbyte=0x08
Jul  7 10:54:35 hostname user.info kernel: sd 2:0:19:0: [sds] Sense Key : 0x3 [current]
Jul  7 10:54:35 hostname user.warn kernel: Info fld=0x123c016d
Jul  7 10:54:35 hostname user.info kernel: sd 2:0:19:0: [sds] ASC=0x11 ASCQ=0x0
Jul  7 10:54:35 hostname user.warn kernel: device-mapper: multipath: Failing path 65:32.
Jul  7 10:54:35 hostname daemon.notice multipathd: 65:32: mark as failed
Jul  7 10:54:35 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 1
Jul  7 10:54:37 hostname user.info kernel: sd 3:0:45:0: [sdcm]
Jul  7 10:54:37 hostname user.info kernel: Result: hostbyte=0x00 driverbyte=0x08
Jul  7 10:54:37 hostname user.info kernel: sd 3:0:45:0: [sdcm] Sense Key : 0x3
Jul  7 10:54:37 hostname user.info kernel: [current]
Jul  7 10:54:37 hostname user.warn kernel: Info fld=0x123c016d
Jul  7 10:54:37 hostname user.info kernel: sd 3:0:45:0: [sdcm]
Jul  7 10:54:37 hostname user.info kernel: ASC=0x11 ASCQ=0x0
Jul  7 10:54:37 hostname user.warn kernel: device-mapper: multipath: Failing path 69:160.
Jul  7 10:54:37 hostname daemon.notice multipathd: 69:160: mark as failed
Jul  7 10:54:37 hostname daemon.warn multipathd: encl2Slot7: Entering recovery mode: max_retries=3
Jul  7 10:54:37 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 0
Jul  7 10:54:39 hostname daemon.warn multipathd: sds: readsector0 checker reports path is up
Jul  7 10:54:39 hostname daemon.notice multipathd: 65:32: reinstated
Jul  7 10:54:42 hostname daemon.notice multipathd: encl2Slot7: queue_if_no_path enabled
Jul  7 10:54:42 hostname daemon.warn multipathd: encl2Slot7: Recovered to normal mode
Jul  7 10:54:42 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 1
Jul  7 10:54:42 hostname daemon.warn multipathd: sdcm: readsector0 checker reports path is up
Jul  7 10:54:42 hostname daemon.notice multipathd: 69:160: reinstated
Jul  7 10:54:42 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 2
Jul  7 10:54:42 hostname user.info kernel: sd 2:0:19:0: [sds] Result: hostbyte=0x00 driverbyte=0x08
Jul  7 10:54:42 hostname user.info kernel: sd 2:0:19:0: [sds]
Jul  7 10:54:42 hostname user.info kernel: Sense Key : 0x3
Jul  7 10:54:42 hostname user.info kernel: [current]
Jul  7 10:54:42 hostname user.info kernel:
Jul  7 10:54:42 hostname user.warn kernel: Info fld=0x123c016d
Jul  7 10:54:42 hostname user.info kernel: sd 2:0:19:0: [sds]
Jul  7 10:54:42 hostname user.info kernel: ASC=0x11 ASCQ=0x0
Jul  7 10:54:42 hostname user.warn kernel: device-mapper: multipath: Failing path 65:32.
Jul  7 10:54:43 hostname daemon.notice multipathd: 65:32: mark as failed
Jul  7 10:54:43 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 1

Should dm-multipath distinguish media failures from actual device
errors?

Is there a different no_path_retry policy that would fail this request
by queue subsequent requests?