[dm-devel] netapp and removed lun sometimes breaks multipath

Thu Mar 7 07:42:32 UTC 2013

Hello,
sometimes I get this kind of problem when using snapdrive from netapp
to disconnect and delete luns that are on multipath.
Sometimes means that the process runs every night and it happens once
every 20 days
>From a snapdrive point of view the snapshot luns were disconnected and
then deleted and the command apparently completes, but from an
operating system view the multipath layer is corrupted and I have to
poweroff the system to solve

$ sudo /sbin/multipath -l
360a9800037543544465d424130543773 dm-10 ,
[size=30G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw]
\_ round-robin 0 [prio=0][enabled]
 \_ #:#:#:# -   #:#   [failed][undef]

Every command using LVM results now blocked

I can see that the problematic commands run by snapdrive are these ones:

# ps -ef|grep 16140
root     16140     1  0 01:26 ?        00:00:00 /bin/bash -c
/sbin/mpath_wait /dev/mapper/360a9800037543544465d424130543773;
/sbin/kpartx -a -p p /dev/mapper/360a9800037543544465d424130543773
root     16148 16140  0 01:26 ?        00:00:00 /sbin/kpartx -a -p p
/dev/mapper/360a9800037543544465d424130543773

I have these two commands about 40 times every 3 minutes because
probably this is a sort of logic of snapdrive to retry.
Even if I don't know why from a command point of view it considers the
command completed

In mesages
Mar  7 01:26:59 noracs3 multipathd: 66:0: mark as failed
Mar  7 01:26:59 noracs3 kernel: ata2: EH complete
Mar  7 01:26:59 noracs3 kernel: device-mapper: multipath: Could not
failover the device: Handler scsi_dh_alua Error 15.
Mar  7 01:26:59 noracs3 kernel: device-mapper: multipath: Failing path 66:0.
Mar  7 01:30:22 noracs3 kernel: INFO: task kpartx:16148 blocked for
more than 120 seconds.
Mar  7 01:30:22 noracs3 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  7 01:30:22 noracs3 kernel: kpartx        D ffffffff80157cde     0
16148  16140                     (NOTLB)
Mar  7 01:30:22 noracs3 kernel:  ffff8102d2b73bc8 0000000000000082
0000000000000001 ffffffff800e6b26
Mar  7 01:30:22 noracs3 kernel:  ffff81012755ad30 0000000000000002
ffff81016d0ef080 ffff81033fead7a0
Mar  7 01:30:22 noracs3 kernel:  0006d3269771a874 000000000000223b
ffff81016d0ef268 0000000f00000003
Mar  7 01:30:22 noracs3 kernel: Call Trace:
Mar  7 01:30:22 noracs3 kernel:  [<ffffffff800e6b26>]
block_read_full_page+0x252/0x26f
Mar  7 01:30:22 noracs3 kernel:  [<ffffffff8006ed48>] do_gettimeofday+0x40/0x90
Mar  7 01:30:22 noracs3 kernel:  [<ffffffff80029173>] sync_page+0x0/0x43
Mar  7 01:30:22 noracs3 kernel:  [<ffffffff800637de>] io_schedule+0x3f/0x67
Mar  7 01:30:22 noracs3 kernel:  [<ffffffff800291b1>] sync_page+0x3e/0x43
Mar  7 01:30:22 noracs3 kernel:  [<ffffffff80063922>]
__wait_on_bit_lock+0x36/0x66
Mar  7 01:30:22 noracs3 kernel:  [<ffffffff8003ff85>] __lock_page+0x5e/0x64

my multipath.conf is like this:

defaults {
        user_friendly_names no
        polling_interval 30
        rr_min_io 100
        no_path_retry queue
        queue_without_daemon    no
        flush_on_last_del       yes
        max_fds                 max
        pg_prio_calc            avg
}

devices {
        device {
                vendor                  "NETAPP"
                product                 "LUN"
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout            "/sbin/mpath_prio_alua /dev/%n"
                features                "3 queue_if_no_path pg_init_retries 50"
                hardware_handler        "1 alua"
                path_grouping_policy    group_by_prio
                failback                immediate
                rr_weight               uniform
                rr_min_io               128
                path_checker            tur
        }
}

tried both in 5.8 an in (not yet supported) 5.9
kernel-2.6.18-348.1.1.el5
device-mapper-multipath-0.4.7-54.el5

right now the system is blocked with

# uptime
 08:29:50 up 22 days, 13:39,  2 users,  load average: 38.09, 38.03, 38.01

You can see that 22 days above is the time I had to reboot the system
the previous occurrence...

I can run any suggested command to check
is there any dm command I can run to debug and try to solve without
poweroff / poweron?

let me know if you need more information
Thanks,
Gianluca

other netapp related info
- parameter different from default shipped
< default-transport="fcp" #Transport type to use for storage
provisioning, when a decision is needed
66d64
< multipathing-type="NativeMPIO" #Multipathing software to use when
more than one multipathing solution is available. Possible values are
'NativeMPIO' or 'none'
78d75
< san-clone-method="optimal" #Clone methods for snap connect

- the commands run in sequence are for each one of the 4 volumes the
system has mounted to make a backup

snapdrive snap disconnect -fs $MM
sleep 3
snapdrive snap delete -snapname
NA01-1:/vol/${VOLBASE}_${DBID}_${VV}_vol:${DBID}_SNAP_1

- problem raising both with
netapp_linux_host_utilities-6-0
netapp.snapdrive-5.0-1

and with the version installed now
netapp_linux_host_utilities-6-1
netapp.snapdrive-5.1-1