[dm-devel] system hang issues with mirror target

Tim Burgess tim.burgess at anu.edu.au
Wed Feb 16 10:37:47 UTC 2005


Hi folks,

As promised, here are the details of the system hangs I've experienced
using the dm mirror target.  In addition, there are some serious
performance issues I've noticed using the vanilla kernel (as compared
with SuSE SLES9 SP1).

There are two configurations I have been testing, both fail under heavy
i/o activity.  The system we are testing is 16 fibre channel JBOD drives
connected to 4 qlogic 23xx HBAs on a 4 processor SGI sn2 itanium system.

We can reliably see ~680MB/s write performance from this setup using the
stripe target across all 16 drives.  The issue comes with mirroring.

Note - we REALLY have an interest in resolving these issues, so I'm more
than happy to test any changes on this system with relatively quick (as
much as timezone differences permit) turnaround.

The desired configuration is a single stripe over 8 mirror pairs.  Each
of the underlying devices should eventually be a multipath target, but
for illustrative purposes we aren't bothering (In addition, the mirror
should prevent applications seeing failure if a single path fails).

Under SuSE SLES9 SP1, this configuration can sustain ~340MB/s to the
mirror target (i.e. 680 to disk).  However, after a certain seemingly
random period of time (minutes not hours) doing heavy i/o (repeated dd's
of the whole stripe device), the kernel will panic (dump attached).

The behaviour can also be reproduced with only a single pair of mirrors,
no stripe, by running one dd's to each mirror simultaneously.

The configuration info and sysrq-p outputs are attached in the file
stack_trace_dm_deadhang.

Using the vanilla (2.6.11-rc4) kernel, however, the performance we see
from this configuration is dismal - about 120MB/s to the mirror.
Watching iostat it appears that i/o is occurring to only mirror pair at
a time, usually for 5-10 seconds, before another pair becomes active
(there is clearly some overlap, since 120MB/s is more than a single
mirrored pair of disks is able to sustain, but the observation holds).

As a result (???) of the poor performance I have not reproduced the
error on 2.6.11-rc4.

HOWEVER (stay tuned folks), by reordering the layers (mirror over two
stripes) I was able to restore the full performance observed under SuSE,
as well as the crash observed under SuSE (not a panic this time, just a
complete system lock).

The configuration information, SysRq-p and SysRq-t outputs are attached
in the file stack_trace_dm_livehang_vanilla.

NOTE: sysrq-t removed - I suspect my first post didn't get through 
because it was too large...

Thanks again for your work,

Tim





-- 
--------------------------------------------------------------------------
                                      ANU Supercomputer Facility
     tim.burgess at anu.edu.au           and APAC National Facility
     Phone: +61 2 6125 1431           Leonard Huxley Bldg (No. 56)
     Fax:   +61 2 6125 8199           Australian National University
                                      Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------
    "Money can buy bandwidth, but latency is forever" -- John Mashey
--------------------------------------------------------------------------





-- 
--------------------------------------------------------------------------
                                     ANU Supercomputer Facility
    tim.burgess at anu.edu.au           and APAC National Facility
    Phone: +61 2 6125 1431           Leonard Huxley Bldg (No. 56)
    Fax:   +61 2 6125 8199           Australian National University
                                     Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------
   "Money can buy bandwidth, but latency is forever" -- John Mashey
--------------------------------------------------------------------------

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stack_trace_dm_deadhang
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20050216/2b54ce82/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: s_t_dm_livehang_vanilla_noT
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20050216/2b54ce82/attachment-0001.ksh>


More information about the dm-devel mailing list