[dm-devel] Shell Scripts or Arbitrary Priority Callouts?

Tue Mar 24 16:36:04 UTC 2009

On Tue, Mar 24, 2009 at 11:43:20AM -0400, John A. Sullivan III wrote:
> I greatly appreciate the help.  I'll answer in the thread below as well
> as consolidating answers to the questions posed in your other email.
> 
> On Tue, 2009-03-24 at 17:01 +0200, Pasi Kärkkäinen wrote:
> > On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > > > 
> > > > Core-iscsi developer seems to be active developing at least the 
> > > > new iSCSI target (LIO target).. I think he has been testing it with
> > > > core-iscsi, so maybe there's newer version somewhere? 
> > > > 
> > > > > We did play with the multipath rr_min_io settings and smaller always
> > > > > seemed to be better until we got into very large numbers of session.  We
> > > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > > ports with disktest using 4K blocks to mimic the file system using
> > > > > sequential reads (and some sequential writes).
> > > > > 
> > > > 
> > > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > > traffic? 
> > > > 
> > 
> > Dunno if you noticed this.. :) 
> We are actually quite enthusiastic about the environment and the
> project.  We hope to have many of these hosting about 400 VServer guests
> running virtual desktops from the X2Go project.  It's not my project but
> I don't mind plugging them as I think it is a great technology.
> 
> We are using jumbo frames.  The ProCurve 2810 switches explicitly state
> to NOT use flow control and jumbo frames simultaneously.  We tried it
> anyway but with poor results.

Ok. 

iirc 2810 does not have very big buffers per port, so you might be better
using flow control instead of jumbos.. then again I'm not sure how good flow
control implementation HP has? 

The whole point of flow control is to prevent packet loss/drop.. this happens
with sending pause frames before the port buffers get full. If port buffers
get full then the switch doesn't have any other option than to drop the
packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
to prevent further packet drops.

flow control "pause frames" cause less delay than tcp-retransmits. 

Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.

> > 
> > 
> > > > > 
> > > > 
> > > > When you used dm RAID0 you didn't have any multipath configuration, right? 
> > > Correct although we also did test successfully with multipath in
> > > failover mode and RAID0.
> > > > 
> > 
> > OK.
> > 
> > > > What kind of stripe size and other settings you had for RAID0?
> > > Chunk size was 8KB with four disks.  
> > > > 
> > 
> > Did you try with much bigger sizes.. 128 kB ?
> We tried slightly larger sizes - 16KB and 32KB I believe and observed
> performance degradation.  In fact, in some scenarios 4KB chunk sizes
> gave us better performance than 8KB.

Ok. 

> > 
> > > > What kind of performance do you get using just a single iscsi session (and
> > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > directly on top of the iscsi /dev/sd? device.
> > > Miserable - same roughly 12 MB/s.
> > 
> > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > readahead-settings? 
> 12MBps is sequential reading but sequential writing is not much
> different.  We did tweak readahead to 1024. We did not want to go much
> larger in order to maintain balance with the various data patterns -
> some of which are random and some of which may not read linearly.

I did some benchmarking earlier between two servers; other one running ietd
target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC. 

I remember getting very close to full gigabit speed at least with bigger
block sizes. I can't remember how much I got with 4 kB blocks. 

Those tests were made with dd.

nullio target is a good way to benchmark your network and initiator and
verify everything is correct. 

Also it's good to first test for example with FTP and Iperf to verify
network is working properly between target and the initiator and all the
other basic settings are correct.

Btw have you configured tcp stacks of the servers? Bigger default tcp window
size, bigger maximun tcp window size etc.. 

> > 
> > Can paste your iSCSI session settings negotiated with the target? 
> Pardon my ignorance :( but, other than packet traces, how do I show the
> final negotiated settings?

Try:

iscsiadm -i -m session
iscsiadm -m session -P3

> > 
> > > > 
> > > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > > disktest threads it's good.. that would make more sense :) 
> > > Yes, the latter.  Single thread (I assume mimicking a single disk
> > > operation, e.g., copying a large file) is miserable - much slower than
> > > local disk despite the availability of huge bandwidth.  We start
> > > utilizing the bandwidth when multiplying concurrent disk activity into
> > > the hundreds.
> > > 
> > > I am guessing the single thread performance problem is an open-iscsi
> > > issue but I was hoping multipath would help us work around it by
> > > utilizing multiple sessions per disk operation.  I suppose that is where
> > > we run into the command ordering problem unless there is something else
> > > afoot.  Thanks - John
> > 
> > You should be able to get many times the throughput you get now.. just with
> > a single path/session.
> > 
> > What kind of latency do you have from the initiator to the target/storage? 
> > 
> > Try with for example 4 kB ping:
> > ping -s 4096 <ip_of_the_iscsi_target>
> We have about 400 micro seconds - that seems a bit high :(
> rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> 

Yeah.. that's a bit high. 

> > 
> > 1000ms divided by the roundtrip you get from ping should give you maximum
> > possible IOPS using a single path.. 
> > 
> 1000 / 0.4 = 2500
> > 4 kB * IOPS == max bandwidth you can achieve.
> 2500 * 4KB = 10 MBps
> Hmm . . . seems like what we are getting.  Is that an abnormally high
> latency? We have tried playing with interrupt coalescing on the
> initiator side but without significant effect.  Thanks for putting
> together the formula for me.  Not only does it help me understand but it
> means I can work on addressing the latency issue without setting up and
> running disk tests.
> 

I think Ross suggested in some other thread the following settings for e1000
NICs:

"Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
and RxRingBufferSize=4096 (verify those option names with a modinfo)
and add those to modprobe.conf."

> I would love to use larger block sizes as you suggest in your other
> email but, on AMD64, I believe we are stuck with 4KB.  I've not seen any
> way to change it and would gladly do so if someone knows how.
> 

Are we talking about filesystem block sizes? That shouldn't be a problem if
your application uses larger blocksizes for read/write operations.. 

Try for example with:
dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024

and optionally add "oflag=direct" (or iflag=direct) if you want to make sure 
caches do not mess up the results. 

> CFQ was indeed a problem.  It would not scale with increasing the number
> of threads.  noop, deadline, and anticipatory all fared much better.  We
> are currently using noop for the iSCSI targets.  Thanks again - John

Yep. And no problems.. hopefully I'm able to help and guide to right
direction :)  

-- Pasi