[dm-devel] puzzling scsi return code 20000

Fri Jul 9 10:10:52 UTC 2004

Hello. This is a follow up of a discussion on linux-scsi list.

quick resume:
I have a san with 2 raids (IFT 7250F), and a farm of servers attached to 
it. All are the same hardware (Bi Xeon 2.8 Ghz, Qla 2310F 2 GB Ram ) & 
running the very same kernels (2.4.26 + dm 0.17 + qla 6.06.64)

The data are on lvm volumes managed with evms. All was running fine 
since months, whit giga & giga of data moved every day.
But at the beginning of the week I began to have errors like thoses :

SCSI disk error : host 3 channel 0 id 2 
>>lun 1 return code = 20000

to the extent were the partition was totally unavailable.

After search & try -all elements were suspected- (the raid itself, his 
firmware, the qlogic driver), I finally managed to have a system working 
again ;

2.4.26 + dm 0.17 + qla 6.06.64 = scsi error
2.4.26 + dm 0.17 + qla 7.00.03 = hang (no errors, but all scsi/qla
operations just hang) then (after long time) scsi errors

-> upgrade the san RAID to the very last firmware,

same kernels as above : same errors

2.6.7 (with embedded dm & qla 8.xx.x) = scsi error too (tried that
because I know lots of work has been done on the scsi, qla & dm layer)

2.4.27-rc3 + dm 0.19 + qla 7.00.03 = No errors.
2.4.27-rc3 + dm 0.19 + qla 6.06.64 = No errors.

As my problem has only arised on lvm volumes that has been resized by
evms, and the only difference between
failing & non failing operation is the device mapper version, I begins
to wonder if the culprit is not here.

Is it possible that something changed beetween dm 1.0.17 and 1.0.19 that can expain this behaviour ?

Or beetween 2.4.26 & 2.4.27-rc3 (in that case I have to change of mailing list again ;-)

Anyway i'm very pleased my problem is solved but i'd like to find a final explanation ...

-- 
Yann Dupont, Cri de l'université de Nantes
Tel: 02.51.12.53.91 - Fax: 02.51.12.58.60 - Yann.Dupont at univ-nantes.fr