From daytooner at gmail.com  Wed Apr 24 00:14:52 2013
From: daytooner at gmail.com (Ken Bass)
Date: Tue, 23 Apr 2013 17:14:52 -0700
Subject: (LONG) Delay when writing to ext4 LVM after boot
Message-ID: <CAH5g024kghoyTHKGDaed21hupYqs=U9cofRq4N4-84V=Zzjgfg@mail.gmail.com>

(I previously asked this question in the LVM list, and they suggested I ask
here.)

I have a large LV, about 6.5T, consisting of 4 physical drives of various
sizes. The LV is formatted as ext4. There is no raid  involved (hardware of
software).

After I first boot, if I try to write a large file (>~ 80M) to this LV, the
write hangs for about 1minute or more, then continues on at full speed and
finishes successfully. Writes of small files don't show this delay. After
that first write and delay, all subsequent writes to other large files
proceed at full speed.

I am currently running Fedora 17 64bit  (kernel 3.8.4-102.fc17.x86_64) but
have noticed this also in previous systems (both 64 and 32bit). With
smaller file systems ( < 1T ), there was a delay, but it was small, and it
increased significantly as I increased the LV size.

I have run e2fsck with the -D option (before attempting a write), which
made no difference. Also, fwiw,  I am mounting this with the default
options. I've tried other options that were suggested to tweak ext4, but,
again, no effect. This LV is also not my system (root) partition - that is
on a separate physical drive.

Any ideas? Suggestions?

(I will gladly supply additional info as requested.)

TIA

ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20130423/39300117/attachment.htm>

From adilger at dilger.ca  Wed Apr 24 15:58:21 2013
From: adilger at dilger.ca (Andreas Dilger)
Date: Wed, 24 Apr 2013 09:58:21 -0600
Subject: (LONG) Delay when writing to ext4 LVM after boot
In-Reply-To: <CAH5g024kghoyTHKGDaed21hupYqs=U9cofRq4N4-84V=Zzjgfg@mail.gmail.com>
References: <CAH5g024kghoyTHKGDaed21hupYqs=U9cofRq4N4-84V=Zzjgfg@mail.gmail.com>
Message-ID: <57F24075-09C7-4D65-A951-AF69CF8CA824@dilger.ca>

On 2013-04-23, at 18:14, Ken Bass <daytooner at gmail.com> wrote:
> I have a large LV, about 6.5T, consisting of 4 physical drives of various sizes. The LV is formatted as ext4. There is no raid  involved (hardware of software).
> 
> After I first boot, if I try to write a large file (>~ 80M) to this LV, the write hangs for about 1minute or more, then continues on at full speed and finishes successfully. Writes of small files don't show this delay. After that first write and delay, all subsequent writes to other large files proceed at full speed.

This is a problem that I am very familiar with for large filesystems. The issue is that if the filesystem is relatively full, the first write needs to load and search a lot of the block bitmaps to try and find enough space to allocate blocks for the write.  Depending on how it was formatted, each block bitmap read needs a seek. 

> I am currently running Fedora 17 64bit  (kernel 3.8.4-102.fc17.x86_64) but have noticed this also in previous systems (both 64 and 32bit). With smaller file systems ( < 1T ), there was a delay, but it was small, and it increased significantly as I increased the LV size.

Might I guess that this filesystem was formatted as ext3 and not as ext4?  In particular, is the "flex_bg" option missing from the Features line in the "dumpe2fs -h /dev/XXX" output?  This feature is enabled by default if formatting as ext4, but not as ext3.

The flex_bg feature will allocate the block bitmaps in large chunks on the disk so that they can be loaded quickly at mount and e2fsck time.  On a 16TB filesystem with 10 ms seek time, in the worst case without flex_bg it could take up to 20 minutes to load all of the bitmaps at boot time without flex_bg...

> I have run e2fsck with the -D option (before attempting a write), which made no difference. Also, fwiw,  I am mounting this with the default options. I've tried other options that were suggested to tweak ext4, but, again, no effect. This LV is also not my system (root) partition - that is on a separate physical drive.
> 
> Any ideas? Suggestions?

Unfortunately, flex_bg is a format-time option, so you would need a full backup-restore to benefit from it for your filesystem.

If there is a delay between mounting and the first write, you could prefetch the bitmaps with "dumpe2fs /dev/XXX > /dev/null" so that it loads all of the bitmaps before they are needed.  Some people do this in a startup script as a workaround for the initial write slowness. 

Changing the allocation policy would not help in your case, since the large file would need more blocks than could be satisfied by the early groups. That is why you don't see a slowdown for small files.

In theory, it would be possible to modify resize2fs to co-locate the bitmaps on disk to enable flex_bg, in the same manner as it currently moves the inode table to add group descriptor blocks, but that would need some non-trivial development. 

Cheers, Andreas



From adilger at dilger.ca  Fri Apr 26 00:10:21 2013
From: adilger at dilger.ca (Andreas Dilger)
Date: Thu, 25 Apr 2013 18:10:21 -0600
Subject: (LONG) Delay when writing to ext4 LVM after boot
In-Reply-To: <CAH5g026gzr31zevbFEUQarWcJ+ehdZYd-uHU+rhNyOx=0+XsgA@mail.gmail.com>
References: <CAH5g024kghoyTHKGDaed21hupYqs=U9cofRq4N4-84V=Zzjgfg@mail.gmail.com>
	<57F24075-09C7-4D65-A951-AF69CF8CA824@dilger.ca>
	<CAH5g026gzr31zevbFEUQarWcJ+ehdZYd-uHU+rhNyOx=0+XsgA@mail.gmail.com>
Message-ID: <536ADCB2-8421-415F-9527-69A9BCD8CC87@dilger.ca>

On 2013-04-25, at 9:06 AM, Ken Bass wrote:
> First, thanks. I think you answered my question. You get your gold star for the day :-).
> 
> The LV fs IS formatted as ext4. There are many files, of various length: (at least 102,959 items, totalling 3.6 TB). Most are either large files, videos and the like ( > 1G) or small files like images or pdfs, etc).
> 
> FYI:
> [root at elmer ken]# time dumpe2fs -h /dev/mapper/VG_NAS-LV_NAS 
> dumpe2fs 1.42.3 (14-May-2012)
> Filesystem volume name:   <none>
> Last mounted on:          /nas
> Filesystem UUID:          fe2db0bc-5e41-4c40-99f0-29e7771090b9
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash 
> Filesystem OS type:       Linux
> Inode count:              366288896
> Block count:              1465133056
> Reserved block count:     73240735
> Free blocks:              571585703
> Free inodes:              366185906
> Block size:               4096
> Reserved GDT blocks:      674
> Blocks per group:         32768
> Fragments per group:      32768
> Flex block group size:    16

This flex_bg size is only 16 blocks (which is the default), but at best only reduces the seeks by a factor of 16 at first access time.  For your block count, this should be about 45k groups, and 2800 clusters of bitmaps.

While e2fsck will load all of those bitmaps in a single pass, it is possible that the kernel does not do this, fetching each one separately and requiring a seek wait (if not in the track cache).

You definitely have a large number of free blocks, so the allocator _should_ be able to find something quickly, but there isn't currently a "fast path" to find a free chunk of large space at mount time.

I suspect this could be worked around in some manner.  We could potentially cache some hints on disk and/or skip a lot of bitmap loads at startup time if the allocation was large.

> [root at elmer ken]# 
> [root at elmer ken]# time dumpe2fs /dev/mapper/VG_NAS-LV_NAS > /dev/null
> dumpe2fs 1.42.3 (14-May-2012)
> 
> real    1m16.634s
> user    1m16.174s
> sys    0m0.066s
> [root at elmer ken]#
> 
> So, there's the LONG delay (1.25m +).
> 
> As a s/w engineer in a previous lifetime, I would think there must be some metadata (on the disk) that keeps a coalesced mapping of free blocks. This would be analogous to a ram heap management system. Granted, most file server systems don't get powered down or reboot very often, but for a very large storage (10G plus) this must be a very significant issue. Just my 2 cents worth.

By "large" you mean "10T plus"?  :-)

> You mentioned doing this at startup. Could this be done in background, so the boot wouldn't take a minute or more extra? And also, any particular place to add this code?

It can be done in any system startup file like /etc/rc.d/rc.local, and can be run in the background either before or while the filesystem is mounted, since it is only reading from the disk.

> On Wed, Apr 24, 2013 at 8:58 AM, Andreas Dilger <adilger at dilger.ca> wrote: 
>> If there is a delay between mounting and the first write, you could prefetch the bitmaps with "dumpe2fs /dev/XXX > /dev/null" so that it loads all of the bitmaps before they are needed.  Some people do this in a startup script as a workaround for the initial write slowness.
>> 
>> Changing the allocation policy would not help in your case, since the large file would need more blocks than could be satisfied by the early groups. That is why you don't see a slowdown for small files.


Cheers, Andreas