[linux-lvm] cache on SSD makes system unresponsive

Fri Oct 20 19:35:00 UTC 2017

>>>>> "Oleg" == Oleg Cherkasov <o1e9 at member.fsf.org> writes:

Oleg> On 19. okt. 2017 21:09, John Stoffel wrote:
>> 
Oleg> Recently I have decided to try out LVM cache feature on one of
Oleg> our Dell NX3100 servers running CentOS 7.4.1708 with 110Tb disk
Oleg> array (hardware RAID5 with H710 and H830 Dell adapters).  Two
Oleg> SSD disks each 256Gb are in hardware RAID1 using H710 adapter
Oleg> with primary and extended partitions so I decided to make ~240Gb
Oleg> LVM cache to see if system I/O may be improved.  The server is
Oleg> running Bareos storage daemon and beside sshd and Dell
Oleg> OpenManage monitoring does not have any other services.
Oleg> Unfortunately testing went not as I expected nonetheless at the
Oleg> end system is up and running with no data corrupted.
>> 
>> Can you give more details about the system.  Is this providing storage
>> services (NFS) or is it just a backup server?

Oleg> It is just a backup server, Bareos Storage Daemon + Dell
Oleg> OpenManage for LSI RAID cards (Dell's H7XX and H8XX are LSI
Oleg> based).  That host deliberately do no share any files or
Oleg> resources for security reasons, so no NFS or SMB.

Well... if it's a backup server, then I suspect that using caching
won't help much because you're mostly doing streaming writes, with
very few reads.  The Cache is designed to help the *read* case more.
And for a backup server, you're writing one or just a couple of
streams at once, which is a fairly ideal state for RAID5.

Oleg> Server has 2x SSD drives by 256Gb each and 10x 3Tb drives.  In
Oleg> addition there are two MD1200 disk arrays attached with 12x 4Tb
Oleg> disks each.  All disks exposed to CentOS as Virtual so there are
Oleg> 4 disks in total:

Oleg> NAME                                      MAJ:MIN RM   SIZE RO TYPE
Oleg> sda                                         8:0    0 278.9G  0 disk
Oleg> ├─sda1                                      8:1    0   500M  0 part /boot
Oleg> ├─sda2                                      8:2    0  36.1G  0 part
Oleg> │ ├─centos-swap                           253:0    0  11.7G  0 lvm  [SWAP]
Oleg> │ └─centos-root                           253:1    0  24.4G  0 lvm
Oleg> ├─sda3                                      8:3    0     1K  0 part
Oleg> └─sda5                                      8:5    0 242.3G  0 part
Oleg> sdb                                         8:16   0    30T  0 disk
Oleg> └─primary_backup_vg-primary_backup_lv     253:5    0 110.1T  0 lvm
Oleg> sdc                                         8:32   0    40T  0 disk
Oleg> └─primary_backup_vg-primary_backup_lv     253:5    0 110.1T  0 lvm
Oleg> sdd                                         8:48   0    40T  0 disk
Oleg> └─primary_backup_vg-primary_backup_lv     253:5    0 110.1T  0 lvm

Oleg> RAM 12Gb, swap around 12Gb as well.  /dev/sda is a hardware RAID1, the 
Oleg> rest are RAID5.

Interesting, it's all hardware RAID devices from what I can see.  

Oleg> I did make a cache and cache_meta on /dev/sda5.  It used to be a
Oleg> partition for Bareos spool for quite some time and because after
Oleg> upgrading to 10GbBASE network I do not need that spooler any
Oleg> more so I decided to try LVM cache.

Can you should the *exact* commands you used to make the cache?  Are
you using lvcache, or bcache?  they're two totally different beasts.
I looked into bcache in the past, but since you can't remove it from
an LV, I decided not to use it.  I use lvcache like this:

> sudo lvs data
  LV          VG   Attr       LSize   Pool           Origin        Data%  Meta%  Move Log Cpy%Sync Convert
      home        data Cwi-aoC--- 650.00g home_cache     [home_corig]
      home_cache  data Cwi---C--- 130.00g
      local       data Cwi-aoC--- 335.00g [localcacheLV] [local_corig]

so I'm wondering exactly which caching setup you're using. 

>> How did you setup your LVM config and your cache config?  Did you
>> mirror the two SSDs using MD, then add the device into your VG and use
>> that to setup the lvcache?

Oleg> All configs are stock CentOS 7.4 at the moment (incrementally upgraded 
Oleg> from 7.0 of course), so I did not customize or tried to make any 
Oleg> optimization on config.

Ok, good to know.

>> I ask because I'm running lvcache at home on my main file/kvm server
>> and I've never seen this problem.  But!  I suspect you're running a
>> much older kernel, lvm config, etc.  Please post the full details of
>> your system if you can.
Oleg> 3.10.0-693.2.2.el7.x86_64

Oleg> CentOS 7.4, as been pointed by Xen, released about a month ago
Oleg> and I had updated about a week ago while doing planned
Oleg> maintenance on network so had a good excuse to reboot it.

Oleg> Initially I have tried the default writethrough mode and after
Oleg> running dd reading test with 250Gb file got system unresponsive
Oleg> for roughly 15min with cache allocation around 50%.  Writing to
Oleg> disks it seems speed up the system however marginally, so around
Oleg> 10% on my tests and I did manage to pull more than 32Tb via
Oleg> backup from different hosts and once system became unresponsive
Oleg> to ssh and icmp requests however for a very short time.

This isn't good.  Can you post more details about your LV setup please?  

>> Can you run 'top' or 'vmstat -admt 10' on the console while you're
>> running your tests to see what the system does?  How does memory look
>> on this system when you're NOT runnig lvcache?

Oleg> Well, it is a production system and I am not planning to cache
Oleg> it again for test however if any patches would be available then
Oleg> try to run a similar system test on spare box before converting
Oleg> it to FreeBSD with ZFS.

How was the performance before your caching tests?  Are you looking
for better compression of your backups?  I've used bacula (which
Bareos is based on) for years, but recently gave up because the
restores sucked to do.  Sorry for the side note.  :-)

Oleg> Nonetheless I tried to run top during the dd reading test
Oleg> however with in first few minutes I did not notice any issues
Oleg> with RAM.  System was using less then 2Gb of 12GB and the rest
Oleg> are wired (cache/buffers).  After few minutes system became
Oleg> unresponsive even dropping ICMP ping requests and ssh session
Oleg> frozen and then dropped after time out, so no way to check top
Oleg> measurements.

Any messages from the console?  

Oleg> I have recovered some of SAR records and I may see the last 20 minutes 
Oleg> SAR did not manage to log anything from 2:40pm to 3:00pm before system 
Oleg> got rebooted and back online at 3:10pm:

Oleg> User stat:
Oleg> 02:00:01 PM     CPU     %user     %nice   %system   %iowait    %steal 
Oleg>   %idle
Oleg> 02:10:01 PM     all      0.22      0.00      0.08      0.05      0.00 
Oleg>   99.64
Oleg> 02:20:35 PM     all      0.21      0.00      5.23     20.58      0.00 
Oleg>   73.98
Oleg> 02:30:51 PM     all      0.23      0.00      0.43     31.06      0.00 
Oleg>   68.27
Oleg> 02:40:02 PM     all      0.06      0.00      0.15     18.55      0.00 
Oleg>   81.24
Oleg> Average:        all      0.19      0.00      1.54     17.67      0.00 
Oleg>   80.61

That looks ok to me... nothing obvious there at all.

Oleg> I/O stat:
Oleg> 02:00:01 PM       tps      rtps      wtps   bread/s   bwrtn/s
Oleg> 02:10:01 PM      5.27      3.19      2.08    109.29    195.38
Oleg> 02:20:35 PM   4404.80   3841.22    563.58 971542.00 140195.66
Oleg> 02:30:51 PM   1110.49    586.67    523.83 148206.31 131721.52
Oleg> 02:40:02 PM    510.72    211.29    299.43  51321.12  76246.81
Oleg> Average:      1566.86   1214.43    352.43 306453.67  88356.03

Are you writing to a spool disk, before you then write the data into
bacula's backup system?

Oleg> DMs:
Oleg> 02:00:01 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz 
Oleg>     await     svctm     %util
Oleg> Average:       dev8-0    370.04    853.43  88355.91    241.08     85.32 
Oleg>    230.56      1.61     59.54
Oleg> Average:      dev8-16      0.02      0.14      0.02      8.18      0.00 
Oleg>      3.71      3.71      0.01
Oleg> Average:      dev8-32   1196.77 305599.78      0.04    255.35      4.26 
Oleg>      3.56      0.09     11.28
Oleg> Average:      dev8-48      0.02      0.35      0.06     18.72      0.00 
Oleg>     17.77     17.77      0.04
Oleg> Average:     dev253-0    151.59    118.15   1094.56      8.00     13.60 
Oleg>     89.71      2.07     31.36
Oleg> Average:     dev253-1     15.01    722.81     53.73     51.73      3.08 
Oleg>    204.85     28.35     42.56
Oleg> Average:     dev253-2   1259.48 218411.68      0.07    173.41      0.21 
Oleg>      0.16      0.08      9.98
Oleg> Average:     dev253-3    681.29      1.27  87189.52    127.98    163.02 
Oleg>    239.29      0.84     57.12
Oleg> Average:     dev253-4      3.83     11.09     18.09      7.61      0.09 
Oleg>     22.59     10.72      4.11
Oleg> Average:     dev253-5   1940.54 305599.86      0.07    157.48      8.47 
Oleg>      4.36      0.06     11.24

That's really bursty traffic... 

Oleg> dev253:2 is the cache or actually was ...

Oleg> Queue stat:
Oleg> 02:00:01 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
Oleg> 02:10:01 PM         1       302      0.09      0.05      0.05         0
Oleg> 02:20:35 PM         0       568      6.87      9.72      5.28         3
Oleg> 02:30:51 PM         1       569      5.46      6.83      5.83         2
Oleg> 02:40:02 PM         0       568      0.18      2.41      4.26         1
Oleg> Average:            0       502      3.15      4.75      3.85         2

Oleg> RAM stat:
Oleg> 02:00:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit 
Oleg>   %commit  kbactive   kbinact   kbdirty
Oleg> 02:10:01 PM    256304  11866580     97.89     66860   9181100   2709288 
Oleg>     11.10   5603576   5066808        32
Oleg> 02:20:35 PM    185160  11937724     98.47     56712     39104   2725476 
Oleg>     11.17    299256    292604        16
Oleg> 02:30:51 PM    175220  11947664     98.55     56712     29640   2730732 
Oleg>     11.19    113912    113552        24
Oleg> 02:40:02 PM  11195028    927856      7.65     57504     62416   2696248 
Oleg>     11.05    119488    164076        16
Oleg> Average:      2952928   9169956     75.64     59447   2328065   2715436 
Oleg>     11.12   1534058   1409260        22

Oleg> SWAP stat:
Oleg> 02:00:01 PM kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
Oleg> 02:10:01 PM  12010984    277012      2.25     71828     25.93
Oleg> 02:20:35 PM  11048040   1239956     10.09     88696      7.15
Oleg> 02:30:51 PM  10723456   1564540     12.73     38272      2.45
Oleg> 02:40:02 PM  10716884   1571112     12.79     77928      4.96
Oleg> Average:     11124841   1163155      9.47     69181      5.95

I think you're running into a RedHat bug at this point.  I'd probably
move to Debian and run my own kernel with the latest patches for MD, etc.

You might even be running into problems with your HW RAID controllers
and how Linux talks to them.

Any chance you could post more details?

Good luck!
John