[linux-lvm] Terrible read performance with LVM

Mon May 31 18:07:09 UTC 2004

Hi all,

I'm fighting with terrible read performance on some RedHat 9 based server
systems when dealing with large log files. The symptoms are as follows:

- logrotate on large (> ~10MB) log files in /usr/local/logs  takes
forever (in the order of minutes) even without compressing
- sendmail processes doing i/o on the same volume (chrooted sendmail in
/usr/local/sendmail) hang in disk I/O while logrotate is running

- detailed output from pv/vg/lvdisplay is at the end of the email

ENVIRONMENT:

- hardware
dual xeon 2.4
2GB memory
2x 120GB IDE drives (master on dedicated channel))

s35:~# hdparm /dev/hda

/dev/hda:
 multcount    = 16 (on)
 IO_support   =  3 (32-bit w/sync)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 14593/255/63, sectors = 234441648, start = 0
s35:~# hdparm /dev/hdc

/dev/hdc:
 multcount    = 16 (on)
 IO_support   =  3 (32-bit w/sync)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 14593/255/63, sectors = 234441648, start = 0
s35:~# hdparm -t /dev/hda

/dev/hda:
 Timing buffered disk reads:  64 MB in  1.18 seconds = 54.24 MB/sec
s35:~# hdparm -t /dev/hdc

/dev/hdc:
 Timing buffered disk reads:  64 MB in  1.16 seconds = 55.17 MB/sec

These reported speeds are consistent with some speed tests I ran on
non-lvm filesystems.

- disk layout:
/dev/hda1 on / type ext3 (rw)
none on /proc type proc (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
none on /dev/shm type tmpfs (rw)
/dev/hda5 on /usr type ext3 (rw)
/dev/sysvg/usrlocal on /usr/local type ext3 (rw)
/dev/sysvg/var on /var type ext3 (rw)

REPRODUCING THE PROBLEM:

I tried to destill this down to a few simple test cases:

4 locations:
/tmp      (in root partition on hda)
/usr/local/logs (in usrlocal volume of sysvg volume group
on hda)
/export/hdc1   (partition on hdc)
/export/hdc2   (in lvol1 volume of datavg volume group on hdc)

All locations are on a ext3 filesystem with journaling. Here are the
additional mount points:

/dev/hdc1 on /export/hdc1 type ext3 (rw)
/dev/datavg/lvol1 on /export/hdc2 type ext3 (rw)

These partitions were created with minimal set of parameters needed. All
files are 65MB and copies are twice in each partition (so that I can mix and
match without worrying too much about file system caching).
I rebooted the system before the test to make sure the file system cache
is empty. I ran each test multiple times with different files. The
results below for each test are typical for each test case.

Test 1: ext3/hda -> ext3/hdc

s35:~# time cp -f /tmp/1/bi2.log.1 /export/hdc1/1

real    0m1.477s
user    0m0.010s
sys     0m0.530s

about 40MB/s. That's fine.

Test 2: ext3/hda -> ext3/lvm/hdc

s35:~# time cp -f /tmp/1/ni1.log.1 /export/hdc2/1

real    0m1.487s
user    0m0.010s
sys     0m0.580s

about 40MB/s. That's fine.

Test 3: ext3/lvm/hdc -> ext3/hda

s35:~# time cp -f /export/hdc1/1/ni1.log.1 /tmp/1

real    0m1.235s
user    0m0.010s
sys     0m0.760s

about 40MB/s throughput. ok.

Test 4: ext3/hdc -> ext3/lvm/hda

s35:~# time cp -f /export/hdc1/1/ni1.log.1 /usr/local/logs/

real    0m3.863s
user    0m0.010s
sys     0m0.560s

17MB/s. uhm, mediocre.

Test 5: ext3/lvm/hda -> ext3/hdc

s35:~# time cp -f /usr/local/logs/bi1.log.1 /export/hdc1/1

real    1m27.139s
user    0m0.000s
sys     0m0.610s

~800kByte/s. that's just flat out pathetic.

BUT, these ones are interesting. I did them right after the previous test run:

s35:~# time cp -f /usr/local/logs/bi2.log.1 /export/hdc1/1

real    0m3.012s
user    0m0.020s
sys     0m0.590s
s35:~# time cp -f /usr/local/logs/ni2.log.1 /export/hdc1/1

real    0m0.583s
user    0m0.000s
sys     0m0.580s
s35:~# time cp -f /usr/local/logs/ni1.log.1 /export/hdc1/1

real    0m0.583s
user    0m0.000s
sys     0m0.580s

However, this might be an anomaly during the test based on the files I
chose. It doesn't match with my real-world observations on the
affected server, and I
could reproduce this only once in testing. Somehow the files got into
the file system cache?

Also, picking another file of similar size brings us right back to:
s35:~# time cp /usr/local/logs/ns4.log.1 /export/hdc1/1
real    1m25.599s
user    0m0.020s
sys     0m0.720s

I suspect, there is some over-aggressive read-caching at work here that
is caching the wrong data, or dual-caching, or something is inhibiting
access to the lvm volume on a regular basis (e.g. the sendmail procesess
mentioned earlier completed immediately when I Ctrl-Z'ed logrotate). 

I can't figure out why only /dev/sysvg on hda would be affected by this.

I verified this behaviour on 4 different servers that were installed
with the same RH9 image with kernels 2.4.20-20.9 and 2.4.20-30.9, so I
doubt it is hardware specific, but rather how the volume group is
configured.

The only differences I am aware of between "sysvg" on hda and "datavg" on hdc:
- sysvg has three logical volumes, datavg has only one
- I created datavg myself for testing a few days ago, while sysvg has
been on the system since it was installed several months ago

If anyone could help shed some light on this I'd be really grateful.
I'd be happy to provide more information if needed.

Thanks,
Bernhard

Here is the output from [pv,vg,lv]display:

s35:~# pvscan
pvscan -- reading all physical volumes (this may take a while...)
pvscan -- ACTIVE   PV "/dev/hdc2" of VG "datavg" [37.25 GB / 17.25 GB free]
pvscan -- ACTIVE   PV "/dev/hda3" of VG "sysvg"  [100.07 GB / 40.50 GB free]
pvscan -- total: 2 [137.34 GB] / in use: 2 [137.34 GB] / in no VG: 0 [0]

s35:~# pvdisplay /dev/hda3
--- Physical volume ---
PV Name               /dev/hda3
VG Name               sysvg
PV Size               100.08 GB [209873160 secs] / NOT usable 4.25 MB
[LVM: 228 KB]
PV#                   1
PV Status             available
Allocatable           yes
Cur LV                3
PE Size (KByte)       4096
Total PE              25618
Free PE               10368
Allocated PE          15250
PV UUID               8rssk7-asXx-WB7X-gSFq-eYC0-2b6R-8lJi1i

s35:~# pvdisplay /dev/hdc2
--- Physical volume ---
PV Name               /dev/hdc2
VG Name               datavg
PV Size               37.26 GB [78140160 secs] / NOT usable 4.19 MB
[LVM: 165 KB]
PV#                   1
PV Status             available
Allocatable           yes
Cur LV                1
PE Size (KByte)       4096
Total PE              9537
Free PE               4417
Allocated PE          5120
PV UUID               ug7pK9-PVu1-1oUI-QakH-yn12-OVbs-5j9C4i

s35:~# vgscan
vgscan -- reading all physical volumes (this may take a while...)
vgscan -- found active volume group "datavg"
vgscan -- found active volume group "sysvg"
vgscan -- "/etc/lvmtab" and "/etc/lvmtab.d" successfully created
vgscan -- WARNING: This program does not do a VGDA backup of your volume groups

s35:~# vgdisplay sysvg
--- Volume group ---
VG Name               sysvg
VG Access             read/write
VG Status             available/resizable
VG #                  0
MAX LV                256
Cur LV                3
Open LV               2
MAX LV Size           255.99 GB
Max PV                256
Cur PV                1
Act PV                1
VG Size               100.07 GB
PE Size               4 MB
Total PE              25618
Alloc PE / Size       15250 / 59.57 GB
Free  PE / Size       10368 / 40.50 GB
VG UUID               UnR7jD-QX7T-4UAR-xqhb-zrtQ-xxbv-0o03qY

s35:~# vgdisplay datavg
--- Volume group ---
VG Name               datavg
VG Access             read/write
VG Status             available/resizable
VG #                  1
MAX LV                256
Cur LV                1
Open LV               1
MAX LV Size           255.99 GB
Max PV                256
Cur PV                1
Act PV                1
VG Size               37.25 GB
PE Size               4 MB
Total PE              9537
Alloc PE / Size       5120 / 20 GB
Free  PE / Size       4417 / 17.25 GB
VG UUID               IAnx3t-0pyn-8ZPB-bh07-YJFp-X57A-mN4JRW

s35:~# lvscan
lvscan -- ACTIVE            "/dev/datavg/lvol1" [20 GB]
lvscan -- ACTIVE            "/dev/sysvg/freespacetouse" [7.81 GB]
lvscan -- ACTIVE            "/dev/sysvg/usrlocal" [43.95 GB]
lvscan -- ACTIVE            "/dev/sysvg/var" [7.81 GB]
lvscan -- 4 logical volumes with 79.57 GB total in 2 volume groups
lvscan -- 4 active logical volumes

s35:~# lvdisplay /dev/sysvg/var
--- Logical volume ---
LV Name                /dev/sysvg/var
VG Name                sysvg
LV Write Access        read/write
LV Status              available
LV #                   3
# open                 1
LV Size                7.81 GB
Current LE             2000
Allocated LE           2000
Allocation             next free
Read ahead sectors     1024
Block device           58:2

s35:~# lvdisplay /dev/sysvg/usrlocal
--- Logical volume ---
LV Name                /dev/sysvg/usrlocal
VG Name                sysvg
LV Write Access        read/write
LV Status              available
LV #                   2
# open                 1
LV Size                43.95 GB
Current LE             11250
Allocated LE           11250
Allocation             next free
Read ahead sectors     1024
Block device           58:1

s35:~# lvdisplay /dev/sysvg/freespacetouse
--- Logical volume ---
LV Name                /dev/sysvg/freespacetouse
VG Name                sysvg
LV Write Access        read/write
LV Status              available
LV #                   1
# open                 0
LV Size                7.81 GB
Current LE             2000
Allocated LE           2000
Allocation             next free
Read ahead sectors     1024
Block device           58:0

s35:~# lvdisplay /dev/datavg/lvol1
--- Logical volume ---
LV Name                /dev/datavg/lvol1
VG Name                datavg
LV Write Access        read/write
LV Status              available
LV #                   1
# open                 1
LV Size                20 GB
Current LE             5120
Allocated LE           5120
Allocation             next free
Read ahead sectors     1024
Block device           58:3