[dm-devel] Shell Scripts or Arbitrary Priority Callouts?

Tue Mar 24 11:02:41 UTC 2009

On Tue, 2009-03-24 at 09:39 +0200, Pasi Kärkkäinen wrote:
> On Mon, Mar 23, 2009 at 05:46:36AM -0400, John A. Sullivan III wrote:
> > On Sun, 2009-03-22 at 17:27 +0200, Pasi Kärkkäinen wrote:
> > > On Fri, Mar 20, 2009 at 06:01:23AM -0400, John A. Sullivan III wrote:
> > > > > 
> > > > > John:
> > > > > 
> > > > > Thanks for the reply.
> > > > > 
> > > > > I ended up writing a small C program to do the priority computation for me.
> > > > > 
> > > > > I have two sets of FC-AL shelves attached to two dual-channel Qlogic
> > > > > cards. That gives me two paths to each disk. I have about 56 spindles
> > > > > in the current configuration, and am tying them together with md
> > > > > software raid.
> > > > > 
> > > > > Now, even though each disk says it handles concurrent I/O on each
> > > > > port, my testing indicates that throughput suffers when using multibus
> > > > > by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/sec
> > > > > when using multibus).
> > > > > 
> > > > > However, with failover, I am effectively using only one channel on
> > > > > each card. With my custom priority callout, I more or less match the
> > > > > disks with even numbers to the even numbered scsi channels with a
> > > > > higher priority. Same with the odd numbered disks and odd numbered
> > > > > channels. The odds are 2ndary on even and vice versa. It seems to work
> > > > > rather well, and appears to spread the load nicely.
> > > > > 
> > > > > Thanks again for your help!
> > > > > 
> > > > I'm really glad you brought up the performance problem. I had posted
> > > > about it a few days ago but it seems to have gotten lost.  We are really
> > > > struggling with performance issues when attempting to combine multiple
> > > > paths (in the case of multipath to one big target) or targets (in the
> > > > case of software RAID0 across several targets) rather than using, in
> > > > effect, JBODs.  In our case, we are using iSCSI.
> > > > 
> > > > Like you, we found that using multibus caused almost a linear drop in
> > > > performance.  Round robin across two paths was half as much as aggregate
> > > > throughput to two separate disks, four paths, one fourth.
> > > > 
> > > > We also tried striping across the targets with software RAID0 combined
> > > > with failover multipath - roughly the same effect.
> > > > 
> > > > We really don't want to be forced to treated SAN attached disks as
> > > > JDOBs.  Has anyone cracked this problem of using them in either multibus
> > > > or RAID0 so we can present them as a single device to the OS and still
> > > > load balance multiple paths.  This is a HUGE problem for us so any help
> > > > is greatly appreciated.  Thanks- John
> > > 
> > > Hello.
> > > 
> > > Hmm.. just a guess, but could this be related to the fact that if your paths
> > > to the storage are different iSCSI sessions (open-iscsi _doesn't_ support
> > > multiple connections per session aka MC/s), then there is a separate SCSI
> > > command queue per path.. and if SCSI requests are split across those queues 
> > > they can get out-of-order and that causes performance drop?
> > > 
> > > See:
> > > http://www.nabble.com/round-robin-with-vmware-initiator-and-iscsi-target-td21958346.html
> > > 
> > > Especially the reply from Ross (CC). Maybe he has some comments :) 
> > > 
> > > -- Pasi
> > <snip>
> > I'm trying to spend a little time on this today and am really feeling my
> > ignorance on the way iSCSI works :(  It looks like linux-iscsi supports
> > MC/S but has not been in active development and will not even compile on
> > my 2.6.27 kernel.
> > 
> > To simplify matters, I did put each SAN interface on a separate network.
> > Thus, all the different sessions.  If I place them all on the same
> > network and use the iface parameters of open-iscsi, does that eliminate
> > the out-of-order problem and allow me to achieve the performance
> > scalability I'm seeking from dm-multipath in multibus mode? Thanks -
> > John
> 
> If you use ifaces feature of open-iscsi, you still get separate sessions.
> 
> open-iscsi just does not support MC/s :(
> 
> I think core-iscsi does support MC/s.. 
> 
> Then you again you should play with the different multipath settings, and
> tweak how often IOs are split to different paths etc.. maybe that helps.
> 
> -- Pasi
<snip>
I think we're pretty much at the end of our options here but I document
what I've found thus far for closure.

Indeed, there seems to be no way around the session problem.  Core-iscsi
does seem to support MC/s but has not been updated in years.  It did not
compile with my 2.6.27 kernel and, given that others seem to have had
the same problem, I did not spend a lot of time troubleshooting it.

We did play with the multipath rr_min_io settings and smaller always
seemed to be better until we got into very large numbers of session.  We
were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
ports with disktest using 4K blocks to mimic the file system using
sequential reads (and some sequential writes).

With a single thread, there was no difference at all - only about 12.79
MB/s no matter what we did.  With 10 threads and only two interfaces,
there was only a slight difference between rr=1 (81.2B/s), rr=10 (78.87)
and rr=100 (80).

However, when we opened to three and four interfaces, there was a huge
jump for rr=1 (100.4, 105.95) versus rr=10 (80.5, 80.75) and rr=100
(74.3, 77.6).

At 100 threads on three or four ports, the best performance shifted to
rr=10 (327 MB/s, 335) rather than rr=1 (291.7, 290.1) or rr=100 (216.3).
At 400 threads, rr=100 started to overtake rr=10 slightly.

This was using all e1000 interfaces. Our first four port test included
one of the on board ports and performance was dramatically less than
three e1000 ports.  Subsequent testing tweaking forcedeth parameters
from defaults yielded no improvement.

After solving the I/O scheduler problem, dm RAID0 behaved better.  It
still did not give us anywhere near a fourfold increase (four disks on
four separate ports) but only marginal improvement (14.3 MB/s) using c=8
(to fit into a jumbo packet, match the zvol block size on the back end
and be two block sizes).  It did, however, give the best balance of
performance being just slightly slower than rr=1 at 10 threads and
slightly slower than rr=10 at 100 threads though not scaling as well to
400 threads.

Thus, collective throughput is acceptable but individual throughput is
still awful.

Thanks, all - John
-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan at opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society