[Linux-cluster] Finding the bottleneck between SAN and GFS2

Wed Jul 1 08:06:21 UTC 2015

Hi,

On 30/06/15 20:37, Daniel Dehennin wrote:
> Hello,
>
> We are experiencing slow VMs on our OpenNebula architecture:
>
> - two Dell PowerEdge M620
>    + Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
>    + 96GB RAM
>    + 2x146Go SAS drives
>
> - 2TB SAN LUN to store qcow2 images with GFS2 over cLVM
>
> We made some tests, installing Linux OS in parallel and we did not find
> any issues with performance.
>
> Since 3 weeks, 17 users use ±60 VMs and everything became slow.
>
> The SAN administrator complain about very high IO/s so we limited each
> VM to 80 IO/s with the libvirt configuration
>
> #+begin_src xml
> <total_iops_sec>80</total_bytes_sec>
> #+end_src
>
> But it did not get better
>
> Today I ran some benchmark to try to find out what happens.
>
> Checking plocks/s
> =================
>
> I started with ping_pong[1] to see how many locks per second the GFS2
> can sustain.
>
> I use it as describe on the samba wiki[2], here are the results:
>
> - starting ”ping_pong /var/lib/one/datastores/test_plock 3” on first
>    node display around 4k plocks/s
>
> - then starting ”ping_pong /var/lib/one/datastores/test_plock 3” on the
>    second node display around 2k on each node
>
> For the single node process, I was expecting an much higher rate, they
> speak about 500k to 1M locks/s.
>
> Do my numbers looks strange?
>
> Checking fileio
> ===============
>
> I use “sysbench --test=fileio” to check inside the VM and outside (on
> bare metal node), with files in cache or cache dropped.
>
> The short result is that bare metal access to the GFS2 without any cache
> is terribly slow, around 2Mb/s and 90 requests/s.
>
> Is there a way to find out if the problem comes from my
> GFS2/corosync/pacemaker configuration or from the SAN?
>
> Regards.
>
>
>
> Following are the full sysbench results
>
> In the VM, qemu disk cache disabled, total_iops_sec = 0
> -------------------------------------------------------
>
> I try with the IO limit but the difference is minimal:
>
> - the request/s drop to ±80
> - the Mb/s is around 1.2Mb/s
>
>      root at vm:~# sysbench --num-threads=16 --test=fileio --file-total-size=9G --file-test-mode=rndrw prepare
>      sysbench 0.4.12:  multi-threaded system evaluation benchmark
>      
>      128 files, 73728Kb each, 9216Mb total
>      Creating files for the test...
>
>      root at vm:~# sysbench --num-threads=16 --test=fileio --file-total-size=9G --file-test-mode=rndrw run
>      sysbench 0.4.12:  multi-threaded system evaluation benchmark
>      
>      Running the test with following options:
>      Number of threads: 16
>      
>      Extra file open flags: 0
>      128 files, 72Mb each
>      9Gb total file size
>      Block size 16Kb
>      Number of random requests for random IO: 10000
>      Read/Write ratio for combined random IO test: 1.50
>      Periodic FSYNC enabled, calling fsync() each 100 requests.
>      Calling fsync() at the end of test, Enabled.
>      Using synchronous I/O mode
>      Doing random r/w test
>      Threads started!
>      Done.
>      
>      Operations performed:  6034 Read, 4019 Write, 12808 Other = 22861 Total
>      Read 94.281Mb  Written 62.797Mb  Total transferred 157.08Mb  (1.4318Mb/sec)
>         91.64 Requests/sec executed
>      
>      Test execution summary:
>          total time:                          109.7050s
>          total number of events:              10053
>          total time taken by event execution: 464.7600
>          per-request statistics:
>               min:                                  0.01ms
>               avg:                                 46.23ms
>               max:                              11488.59ms
>               approx.  95 percentile:             125.81ms
>      
>      Threads fairness:
>          events (avg/stddev):           628.3125/59.81
>          execution time (avg/stddev):   29.0475/6.34
>
> On the bare metal node, with the caches dropped
> -----------------------------------------------
>
> After creating the 128 files, I drop the caches to get “from SAN” results.
>
>      root at nebula1:/var/lib/one/datastores/bench# sysbench --num-threads=16 --test=fileio --file-total-size=9G --file-test-mode=rndrw prepare
>      sysbench 0.4.12:  multi-threaded system evaluation benchmark
>      
>      128 files, 73728Kb each, 9216Mb total
>      Creating files for the test...
>
>      # DROP CACHES
>      root at nebula1: echo 3 > /proc/sys/vm/drop_caches
>      
>      root at nebula1:/var/lib/one/datastores/bench# sysbench --num-threads=16 --test=fileio --file-total-size=9G --file-test-mode=rndrw run
>      sysbench 0.4.12:  multi-threaded system evaluation benchmark
>      
>      Running the test with following options:
>      Number of threads: 16
>      
>      Extra file open flags: 0
>      128 files, 72Mb each
>      9Gb total file size
>      Block size 16Kb
>      Number of random requests for random IO: 10000
>      Read/Write ratio for combined random IO test: 1.50
>      Periodic FSYNC enabled, calling fsync() each 100 requests.
>      Calling fsync() at the end of test, Enabled.
>      Using synchronous I/O mode
>      Doing random r/w test
>      Threads started!
>      Done.
>      
>      Operations performed:  6013 Read, 3999 Write, 12800 Other = 22812 Total
>      Read 93.953Mb  Written 62.484Mb  Total transferred 156.44Mb  (1.5465Mb/sec)
>         98.98 Requests/sec executed
>      
>      Test execution summary:
>          total time:                          101.1559s
>          total number of events:              10012
>          total time taken by event execution: 1109.0862
>          per-request statistics:
>               min:                                  0.01ms
>               avg:                                110.78ms
>               max:                              13098.27ms
>               approx.  95 percentile:             164.52ms
>      
>      Threads fairness:
>          events (avg/stddev):           625.7500/114.50
>          execution time (avg/stddev):   69.3179/6.54
>
>
> On the bare metal node, with the test files filled in the cache
> ---------------------------------------------------------------
>
> I run md5sum on all the files to let the kernel cache them.
>
>      # Load files in cache
>      root at nebula1:/var/lib/one/datastores/bench# md5sum test*
>
>      root at nebula1:/var/lib/one/datastores/bench# sysbench --num-threads=16 --test=fileio --file-total-size=9G --file-test-mode=rndrw run
>      sysbench 0.4.12:  multi-threaded system evaluation benchmark
>      
>      Running the test with following options:
>      Number of threads: 16
>      
>      Extra file open flags: 0
>      128 files, 72Mb each
>      9Gb total file size
>      Block size 16Kb
>      Number of random requests for random IO: 10000
>      Read/Write ratio for combined random IO test: 1.50
>      Periodic FSYNC enabled, calling fsync() each 100 requests.
>      Calling fsync() at the end of test, Enabled.
>      Using synchronous I/O mode
>      Doing random r/w test
>      Threads started!
>      Done.
>      
>      Operations performed:  6069 Read, 4061 Write, 12813 Other = 22943 Total
>      Read 94.828Mb  Written 63.453Mb  Total transferred 158.28Mb  (54.896Mb/sec)
>       3513.36 Requests/sec executed
>      
>      Test execution summary:
>          total time:                          2.8833s
>          total number of events:              10130
>          total time taken by event execution: 16.3824
>          per-request statistics:
>               min:                                  0.01ms
>               avg:                                  1.62ms
>               max:                                760.53ms
>               approx.  95 percentile:               5.51ms
>      
>      Threads fairness:
>          events (avg/stddev):           633.1250/146.90
>          execution time (avg/stddev):   1.0239/0.33
>
>
> Footnotes:
> [1]  https://git.samba.org/?p=ctdb.git;a=blob;f=utils/ping_pong/ping_pong.c
>
> [2]  https://wiki.samba.org/index.php/Ping_pong
>
>
>

Why are you using the ping_pong test? Does qemu use fcntl locks? Are you 
trying to share any of those images across nodes? (i.e. mounted on more 
than one node at once?)

What is the raw speed of the block device? I'd also suggest checking the 
files that are created to see if they are being fragmented (the filefrag 
tool will tell you) in case that is the problem?

Steve.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150701/ae7f96a6/attachment.htm>