From viji at fedoraproject.org  Sat Oct 17 06:52:48 2009
From: viji at fedoraproject.org (Viji V Nair)
Date: Sat, 17 Oct 2009 12:22:48 +0530
Subject: optimising filesystem for many small files
Message-ID: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>

Hi,

System : Fedora 11 x86_64
Current Filesystem: 150G ext4 (formatted with "-T small" option)
Number of files: 50 Million, 1 to 30K png images

We are generating these files using a python programme and getting very slow
IO performance. While generation there in only write, no read. After
generation there is heavy read and no write.

I am looking for best practices/recommendation to get a better performance.

Any suggestions of the above are greatly appreciated.

Viji
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20091017/5c84b756/attachment.htm>

From sandeen at redhat.com  Sat Oct 17 14:32:57 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Sat, 17 Oct 2009 09:32:57 -0500
Subject: optimising filesystem for many small files
In-Reply-To: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
Message-ID: <4AD9D599.3000306@redhat.com>

Viji V Nair wrote:
> Hi,
> 
> System : Fedora 11 x86_64
> Current Filesystem: 150G ext4 (formatted with "-T small" option)
> Number of files: 50 Million, 1 to 30K png images
> 
> We are generating these files using a python programme and getting very 
> slow IO performance. While generation there in only write, no read. 
> After generation there is heavy read and no write.
> 
> I am looking for best practices/recommendation to get a better performance.
> 
> Any suggestions of the above are greatly appreciated.
> 
> Viji
> 

I would start with using blktrace and/or seekwatcher to see what your IO 
patterns look like when you're populating the disk; I would guess that 
you're seeing IO scattered all over.

How you are placing the files in subdirectories will affect this quite a 
lot; sitting in 1 directory for a while, filling with images, before 
moving on to the next directory, will probably help.  Putting each new 
file in a new subdirectory will probably give very bad results.

-Eric



From viji at fedoraproject.org  Sat Oct 17 17:56:04 2009
From: viji at fedoraproject.org (Viji V Nair)
Date: Sat, 17 Oct 2009 23:26:04 +0530
Subject: optimising filesystem for many small files
In-Reply-To: <4AD9D599.3000306@redhat.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
Message-ID: <84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>

these files are not in a single directory, this is a pyramid
structure. There are total 15 pyramids and coming down from top to
bottom the sub directories and files  are multiplied by a factor of 4.

The IO is scattered all over!!!! and this is a single disk file system.

Since the python application is creating files, it is creating
multiple files to multiple sub directories at a time.

On Sat, Oct 17, 2009 at 8:02 PM, Eric Sandeen <sandeen at redhat.com> wrote:
> Viji V Nair wrote:
>>
>> Hi,
>>
>> System : Fedora 11 x86_64
>> Current Filesystem: 150G ext4 (formatted with "-T small" option)
>> Number of files: 50 Million, 1 to 30K png images
>>
>> We are generating these files using a python programme and getting very
>> slow IO performance. While generation there in only write, no read. After
>> generation there is heavy read and no write.
>>
>> I am looking for best practices/recommendation to get a better
>> performance.
>>
>> Any suggestions of the above are greatly appreciated.
>>
>> Viji
>>
>
> I would start with using blktrace and/or seekwatcher to see what your IO
> patterns look like when you're populating the disk; I would guess that
> you're seeing IO scattered all over.
>
> How you are placing the files in subdirectories will affect this quite a
> lot; sitting in 1 directory for a while, filling with images, before moving
> on to the next directory, will probably help. ?Putting each new file in a
> new subdirectory will probably give very bad results.
>
> -Eric
>



From kshelby at optonline.net  Sat Oct 17 20:35:45 2009
From: kshelby at optonline.net (Ken Shelby)
Date: Sat, 17 Oct 2009 16:35:45 -0400
Subject: optimising filesystem for many small files
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
Message-ID: <3DCC493E70044ACD92E8FBE3B1548EBC@toy2k4>

IMHO, software tuning will only yield incremental improvments.  I suggest 
that you throw more and better hardware at the problem.  And, as always, 
YMMV.

- Ken



----- Original Message ----- 
From: "Viji V Nair" <viji at fedoraproject.org>
To: "Eric Sandeen" <sandeen at redhat.com>
Cc: <linux-ext4 at vger.kernel.org>; <ext3-users at redhat.com>
Sent: Saturday, October 17, 2009 1:56 PM
Subject: Re: optimising filesystem for many small files


> these files are not in a single directory, this is a pyramid
> structure. There are total 15 pyramids and coming down from top to
> bottom the sub directories and files  are multiplied by a factor of 4.
>
> The IO is scattered all over!!!! and this is a single disk file system.
>
> Since the python application is creating files, it is creating
> multiple files to multiple sub directories at a time.
>
> On Sat, Oct 17, 2009 at 8:02 PM, Eric Sandeen <sandeen at redhat.com> wrote:
>> Viji V Nair wrote:
>>>
>>> Hi,
>>>
>>> System : Fedora 11 x86_64
>>> Current Filesystem: 150G ext4 (formatted with "-T small" option)
>>> Number of files: 50 Million, 1 to 30K png images
>>>
>>> We are generating these files using a python programme and getting very
>>> slow IO performance. While generation there in only write, no read. 
>>> After
>>> generation there is heavy read and no write.
>>>
>>> I am looking for best practices/recommendation to get a better
>>> performance.
>>>
>>> Any suggestions of the above are greatly appreciated.
>>>
>>> Viji
>>>
>>
>> I would start with using blktrace and/or seekwatcher to see what your IO
>> patterns look like when you're populating the disk; I would guess that
>> you're seeing IO scattered all over.
>>
>> How you are placing the files in subdirectories will affect this quite a
>> lot; sitting in 1 directory for a while, filling with images, before 
>> moving
>> on to the next directory, will probably help. Putting each new file in a
>> new subdirectory will probably give very bad results.
>>
>> -Eric
>>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>
> __________ Information from ESET NOD32 Antivirus, version of virus 
> signature database 4518 (20091017) __________
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
> 



From tytso at mit.edu  Sat Oct 17 22:26:19 2009
From: tytso at mit.edu (Theodore Tso)
Date: Sat, 17 Oct 2009 18:26:19 -0400
Subject: optimising filesystem for many small files
In-Reply-To: <84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
Message-ID: <20091017222619.GA10074@mit.edu>

On Sat, Oct 17, 2009 at 11:26:04PM +0530, Viji V Nair wrote:
> these files are not in a single directory, this is a pyramid
> structure. There are total 15 pyramids and coming down from top to
> bottom the sub directories and files  are multiplied by a factor of 4.
> 
> The IO is scattered all over!!!! and this is a single disk file system.
> 
> Since the python application is creating files, it is creating
> multiple files to multiple sub directories at a time.

What is the application trying to do, at a high level?  Sometimes it's
not possible to optimize a filesystem against a badly designed
application.  :-(

It sounds like it is generating files distributed in subdirectories in
a completely random order.  How are the files going to be read
afterwards?  In the order they were created, or some other order
different from the order in which they were read?

With a sufficiently bad access patterns, there may not be a lot you
can do, other than (a) throw hardware at the problem, or (b) fix or
redesign the application to be more intelligent (if possible).

	     		       	    		    - Ted



From viji at fedoraproject.org  Sun Oct 18 09:31:46 2009
From: viji at fedoraproject.org (Viji V Nair)
Date: Sun, 18 Oct 2009 15:01:46 +0530
Subject: optimising filesystem for many small files
In-Reply-To: <20091017222619.GA10074@mit.edu>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
Message-ID: <84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>

On Sun, Oct 18, 2009 at 3:56 AM, Theodore Tso <tytso at mit.edu> wrote:
> On Sat, Oct 17, 2009 at 11:26:04PM +0530, Viji V Nair wrote:
>> these files are not in a single directory, this is a pyramid
>> structure. There are total 15 pyramids and coming down from top to
>> bottom the sub directories and files ?are multiplied by a factor of 4.
>>
>> The IO is scattered all over!!!! and this is a single disk file system.
>>
>> Since the python application is creating files, it is creating
>> multiple files to multiple sub directories at a time.
>
> What is the application trying to do, at a high level? ?Sometimes it's
> not possible to optimize a filesystem against a badly designed
> application. ?:-(

The application is reading the gis data from a data source and
plotting the map tiles (256x256, png images) for different zoom
levels. The tree output of the first zoom level is as follows

/tiles/00
`-- 000
    `-- 000
        |-- 000
        |   `-- 000
        |       `-- 000
        |           |-- 000.png
        |           `-- 001.png
        |-- 001
        |   `-- 000
        |       `-- 000
        |           |-- 000.png
        |           `-- 001.png
        `-- 002
            `-- 000
                `-- 000
                    |-- 000.png
                    `-- 001.png

in each zoom level the fourth level directories are multiplied by a
factor of four. Also the number of png images are multiplied by the
same number.
>
> It sounds like it is generating files distributed in subdirectories in
> a completely random order. ?How are the files going to be read
> afterwards? ?In the order they were created, or some other order
> different from the order in which they were read?

The application which we are using are modified versions of mapnik and
tilecache, these are single threaded so we are running 4 process at a
time. We can say only four images are created at a single point of
time. Some times a single image is taking around 20 sec to create. I
can see lots of system resources are free, memory, processors etc
(these are 4G, 2 x 5420 XEON)

I have checked the delay in the backend data source, it is on a 12Gbps
LAN and no delay at all.

These images are also read in the same manner.

>
> With a sufficiently bad access patterns, there may not be a lot you
> can do, other than (a) throw hardware at the problem, or (b) fix or
> redesign the application to be more intelligent (if possible).
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>

The file system is crated with "-i 1024 -b 1024" for larger inode
number, 50% of the total images are less than 10KB. I have disabled
access time and given a large value to the commit also. Do you have
any other recommendation of the file system creation?

Viji



From jburgess777 at googlemail.com  Sun Oct 18 11:25:10 2009
From: jburgess777 at googlemail.com (Jon Burgess)
Date: Sun, 18 Oct 2009 12:25:10 +0100
Subject: optimising filesystem for many small files
In-Reply-To: <84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
Message-ID: <1255865110.27380.52.camel@localhost.localdomain>

On Sun, 2009-10-18 at 15:01 +0530, Viji V Nair wrote:
> The application which we are using are modified versions of mapnik and
> tilecache, these are single threaded so we are running 4 process at a
> time. 

If your tiles use the OpenStreetMap / Google style 900913 projection
then you could consider using mod_tile[1]. This renders and stores each
block of 8 x 8 PNG map tiles inside a single file on the disk. This
makes the storage and access much more efficient. It cuts down the
number of files on the disk by 64 and allows nearby tiles to be read
from a single file.

	Jon

1: http://wiki.openstreetmap.org/wiki/Mod_tile




From mnalis-ml at voyager.hr  Sun Oct 18 11:41:00 2009
From: mnalis-ml at voyager.hr (Matija Nalis)
Date: Sun, 18 Oct 2009 13:41:00 +0200
Subject: optimising filesystem for many small files
In-Reply-To: <84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
Message-ID: <20091018114100.GA26721@eagle102.home.lan>

On Sun, Oct 18, 2009 at 03:01:46PM +0530, Viji V Nair wrote:
> The application which we are using are modified versions of mapnik and
> tilecache, these are single threaded so we are running 4 process at a

How does it scale if you reduce the number or processes - especially if you
run just one of those ? As this is just a single disk, 4 simultaneous
readers/writers would probably *totally* kill it with seeks.

I suspect it might even run faster with just 1 process then with 4 of
them...

> time. We can say only four images are created at a single point of
> time. Some times a single image is taking around 20 sec to create. I

is that 20 secs just the write time for an precomputed file of 10k ? 
Or does it also include reading and processing and writing ?

> can see lots of system resources are free, memory, processors etc
> (these are 4G, 2 x 5420 XEON)

I do not see how the "lots of memory" could be free, especially with such a
large number of inodes. dentry and inode cache alone should consume those
pretty fast as the number of files grow, not to mention (dirty and
otherwise) buffers...

You may want to tune following sysctls to allow more stuff to remain in
write-back cache (but then again, you will probably need more memory):

vm.vfs_cache_pressure
vm.dirty_writeback_centisecs
vm.dirty_expire_centisecs
vm.dirty_background_ratio
vm.dirty_ratio


> The file system is crated with "-i 1024 -b 1024" for larger inode
> number, 50% of the total images are less than 10KB. I have disabled
> access time and given a large value to the commit also. Do you have
> any other recommendation of the file system creation?

for ext3, larger journal on external journal device (if that is an option)
should probably help, as it would reduce some of the seeks which are most
probably slowing this down immensely.


If you can modify hardware setup, RAID10 (better with many smaller disks
than with fewer bigger ones) should help *very* much. Flash-disk-thingies of
appropriate size are even better option (as the seek issues are few orders
of magnitude smaller problem). Also probably more RAM (unless you full
dataset is much smaller than 2 GB, which I doubt). 

On the other hand, have you tried testing some other filesystems ? 
I've had much better performance with lots of small files of XFS (but that
was on big RAID5, so YMMV), for example.

-- 
Opinions above are GNU-copylefted.



From viji at fedoraproject.org  Sun Oct 18 12:51:53 2009
From: viji at fedoraproject.org (Viji V Nair)
Date: Sun, 18 Oct 2009 18:21:53 +0530
Subject: optimising filesystem for many small files
In-Reply-To: <1255865110.27380.52.camel@localhost.localdomain>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
	<1255865110.27380.52.camel@localhost.localdomain>
Message-ID: <84c89ac10910180551j123e94a7ud5dce240619ab99@mail.gmail.com>

On Sun, Oct 18, 2009 at 4:55 PM, Jon Burgess <jburgess777 at googlemail.com> wrote:
> On Sun, 2009-10-18 at 15:01 +0530, Viji V Nair wrote:
>> The application which we are using are modified versions of mapnik and
>> tilecache, these are single threaded so we are running 4 process at a
>> time.
>
> If your tiles use the OpenStreetMap / Google style 900913 projection
> then you could consider using mod_tile[1]. This renders and stores each
> block of 8 x 8 PNG map tiles inside a single file on the disk. This
> makes the storage and access much more efficient. It cuts down the
> number of files on the disk by 64 and allows nearby tiles to be read
> from a single file.
>
> ? ? ? ?Jon
>
> 1: http://wiki.openstreetmap.org/wiki/Mod_tile

we are using our own data set, not google map or openstreet map. Since
we are using mapnik layer I will surely give a try with this.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>



From viji at fedoraproject.org  Sun Oct 18 13:08:06 2009
From: viji at fedoraproject.org (Viji V Nair)
Date: Sun, 18 Oct 2009 18:38:06 +0530
Subject: Fwd: optimising filesystem for many small files
In-Reply-To: <20091018114100.GA26721@eagle102.home.lan>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
	<20091018114100.GA26721@eagle102.home.lan>
Message-ID: <84c89ac10910180608v76caf7f5y4837ccaf6a66a594@mail.gmail.com>

---------- Forwarded message ----------
From: Matija Nalis <mnalis-ml at voyager.hr>
Date: Sun, Oct 18, 2009 at 5:11 PM
Subject: Re: optimising filesystem for many small files
To: Viji V Nair <viji at fedoraproject.org>
Cc: linux-ext4 at vger.kernel.org, ext3-users at redhat.com


On Sun, Oct 18, 2009 at 03:01:46PM +0530, Viji V Nair wrote:
> The application which we are using are modified versions of mapnik and
> tilecache, these are single threaded so we are running 4 process at a

How does it scale if you reduce the number or processes - especially if you
run just one of those ? As this is just a single disk, 4 simultaneous
readers/writers would probably *totally* kill it with seeks.

I suspect it might even run faster with just 1 process then with 4 of
them...

with one process it is giving me 6 seconds

> time. We can say only four images are created at a single point of
> time. Some times a single image is taking around 20 sec to create. I

is that 20 secs just the write time for an precomputed file of 10k ?
Or does it also include reading and processing and writing ?

this include processing and writing

> can see lots of system resources are free, memory, processors etc
> (these are 4G, 2 x 5420 XEON)

I do not see how the "lots of memory" could be free, especially with such a
large number of inodes. dentry and inode cache alone should consume those
pretty fast as the number of files grow, not to mention (dirty and
otherwise) buffers...

[root at test ~]# free
             total       used       free     shared    buffers     cached
Mem:       4011956    3100900     911056          0     550576    1663656
-/+ buffers/cache:     886668    3125288
Swap:      4095992          0    4095992

[root at test ~]# cat /proc/meminfo
MemTotal:        4011956 kB
MemFree:          907968 kB
Buffers:          550016 kB
Cached:          1668984 kB
SwapCached:            0 kB
Active:          1084492 kB
Inactive:        1154608 kB
Active(anon):       5100 kB
Inactive(anon):    15148 kB
Active(file):    1079392 kB
Inactive(file):  1139460 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4095992 kB
SwapFree:        4095992 kB
Dirty:              7088 kB
Writeback:             0 kB
AnonPages:         19908 kB
Mapped:             6476 kB
Slab:             813968 kB
SReclaimable:     796868 kB
SUnreclaim:        17100 kB
PageTables:         4376 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6101968 kB
Committed_AS:      99748 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      290308 kB
VmallocChunk:   34359432003 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        8192 kB
DirectMap2M:     4182016 kB

You may want to tune following sysctls to allow more stuff to remain in
write-back cache (but then again, you will probably need more memory):

vm.vfs_cache_pressure
vm.dirty_writeback_centisecs
vm.dirty_expire_centisecs
vm.dirty_background_ratio
vm.dirty_ratio

I will give a try.


> The file system is crated with "-i 1024 -b 1024" for larger inode
> number, 50% of the total images are less than 10KB. I have disabled
> access time and given a large value to the commit also. Do you have
> any other recommendation of the file system creation?

for ext3, larger journal on external journal device (if that is an option)
should probably help, as it would reduce some of the seeks which are most
probably slowing this down immensely.


If you can modify hardware setup, RAID10 (better with many smaller disks
than with fewer bigger ones) should help *very* much. Flash-disk-thingies of
appropriate size are even better option (as the seek issues are few orders
of magnitude smaller problem). Also probably more RAM (unless you full
dataset is much smaller than 2 GB, which I doubt).

On the other hand, have you tried testing some other filesystems ?
I've had much better performance with lots of small files of XFS (but that
was on big RAID5, so YMMV), for example.

I have not tried XFS, but tried reiserfs. I could not see a large
difference when compared with mkfs.ext4 -T small. I could see that
reiser is giving better performance on overwrite, not on new writes.
some times we overwrite existing image with new ones.

Now the total files are 50Million, soon (with in an year) it will grow
to 1 Billion. I know that we should move ahead with the hardware
upgrades, also files system access is a large concern for us. There
images are accessed over the internet and expecting a 100 million
visits every month. For each user we need to transfer at least 3Mb of
data.
--
Opinions above are GNU-copylefted.



From viji at fedoraproject.org  Sun Oct 18 13:14:40 2009
From: viji at fedoraproject.org (Viji V Nair)
Date: Sun, 18 Oct 2009 18:44:40 +0530
Subject: optimising filesystem for many small files
In-Reply-To: <20091018114100.GA26721@eagle102.home.lan>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
	<20091018114100.GA26721@eagle102.home.lan>
Message-ID: <84c89ac10910180614l5d2d476ehb91d210820761039@mail.gmail.com>

On Sun, Oct 18, 2009 at 5:11 PM, Matija Nalis <mnalis-ml at voyager.hr> wrote:
> On Sun, Oct 18, 2009 at 03:01:46PM +0530, Viji V Nair wrote:
>> The application which we are using are modified versions of mapnik and
>> tilecache, these are single threaded so we are running 4 process at a
>
> How does it scale if you reduce the number or processes - especially if you
> run just one of those ? As this is just a single disk, 4 simultaneous
> readers/writers would probably *totally* kill it with seeks.
>
> I suspect it might even run faster with just 1 process then with 4 of
> them...

with one process it is giving me 6 seconds

>
>> time. We can say only four images are created at a single point of
>> time. Some times a single image is taking around 20 sec to create. I
>
> is that 20 secs just the write time for an precomputed file of 10k ?
> Or does it also include reading and processing and writing ?

this include processing and writing

>
>> can see lots of system resources are free, memory, processors etc
>> (these are 4G, 2 x 5420 XEON)
>
> I do not see how the "lots of memory" could be free, especially with such a
> large number of inodes. dentry and inode cache alone should consume those
> pretty fast as the number of files grow, not to mention (dirty and
> otherwise) buffers...

[root test ~]# free
             total       used       free     shared    buffers     cached
Mem:       4011956    3100900     911056          0     550576    1663656
-/+ buffers/cache:     886668    3125288
Swap:      4095992          0    4095992

[root test ~]# cat /proc/meminfo
MemTotal:        4011956 kB
MemFree:          907968 kB
Buffers:          550016 kB
Cached:          1668984 kB
SwapCached:            0 kB
Active:          1084492 kB
Inactive:        1154608 kB
Active(anon):       5100 kB
Inactive(anon):    15148 kB
Active(file):    1079392 kB
Inactive(file):  1139460 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4095992 kB
SwapFree:        4095992 kB
Dirty:              7088 kB
Writeback:             0 kB
AnonPages:         19908 kB
Mapped:             6476 kB
Slab:             813968 kB
SReclaimable:     796868 kB
SUnreclaim:        17100 kB
PageTables:         4376 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6101968 kB
Committed_AS:      99748 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      290308 kB
VmallocChunk:   34359432003 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        8192 kB
DirectMap2M:     4182016 kB


>
> You may want to tune following sysctls to allow more stuff to remain in
> write-back cache (but then again, you will probably need more memory):
>
> vm.vfs_cache_pressure
> vm.dirty_writeback_centisecs
> vm.dirty_expire_centisecs
> vm.dirty_background_ratio
> vm.dirty_ratio
>

I will give a try.

>
>> The file system is crated with "-i 1024 -b 1024" for larger inode
>> number, 50% of the total images are less than 10KB. I have disabled
>> access time and given a large value to the commit also. Do you have
>> any other recommendation of the file system creation?
>
> for ext3, larger journal on external journal device (if that is an option)
> should probably help, as it would reduce some of the seeks which are most
> probably slowing this down immensely.
>
>
> If you can modify hardware setup, RAID10 (better with many smaller disks
> than with fewer bigger ones) should help *very* much. Flash-disk-thingies of
> appropriate size are even better option (as the seek issues are few orders
> of magnitude smaller problem). Also probably more RAM (unless you full
> dataset is much smaller than 2 GB, which I doubt).
>
> On the other hand, have you tried testing some other filesystems ?
> I've had much better performance with lots of small files of XFS (but that
> was on big RAID5, so YMMV), for example.
>
> --
> Opinions above are GNU-copylefted.
>

I have not tried XFS, but tried reiserfs. I could not see a large
difference when compared with mkfs.ext4 -T small. I could see that
reiser is giving better performance on overwrite, not on new writes.
some times we overwrite existing image with new ones.

Now the total files are 50Million, soon (with in an year) it will grow
to 1 Billion. I know that we should move ahead with the hardware
upgrades, also files system access is a large concern for us. There
images are accessed over the internet and expecting a 100 million
visits every month. For each user we need to transfer at least 3Mb of
data.



From jburgess777 at googlemail.com  Sun Oct 18 15:07:37 2009
From: jburgess777 at googlemail.com (Jon Burgess)
Date: Sun, 18 Oct 2009 16:07:37 +0100
Subject: optimising filesystem for many small files
In-Reply-To: <84c89ac10910180614l5d2d476ehb91d210820761039@mail.gmail.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
	<20091018114100.GA26721@eagle102.home.lan>
	<84c89ac10910180614l5d2d476ehb91d210820761039@mail.gmail.com>
Message-ID: <1255878457.27380.138.camel@localhost.localdomain>

On Sun, 2009-10-18 at 18:44 +0530, Viji V Nair wrote:
> On Sun, Oct 18, 2009 at 5:11 PM, Matija Nalis <mnalis-ml at voyager.hr> wrote:
> > On Sun, Oct 18, 2009 at 03:01:46PM +0530, Viji V Nair wrote:
> >> The application which we are using are modified versions of mapnik and
> >> tilecache, these are single threaded so we are running 4 process at a
> >
> > How does it scale if you reduce the number or processes - especially if you
> > run just one of those ? As this is just a single disk, 4 simultaneous
> > readers/writers would probably *totally* kill it with seeks.
> >
> > I suspect it might even run faster with just 1 process then with 4 of
> > them...
> 
> with one process it is giving me 6 seconds

That seems a little slow. Have you looked in optimising your mapnik
setup? The mapnik-users list or IRC channel is a good place to ask[1].

For comparison, the OpenStreetMap tile server typically renders a 8x8
block of 64 tiles in about 1 second, although the time varies greatly
depending on the amount of data within the tiles.

> >
> >> time. We can say only four images are created at a single point of
> >> time. Some times a single image is taking around 20 sec to create. I
> >
> > is that 20 secs just the write time for an precomputed file of 10k ?
> > Or does it also include reading and processing and writing ?
> 
> this include processing and writing
> 
> >
> >> can see lots of system resources are free, memory, processors etc
> >> (these are 4G, 2 x 5420 XEON)

4GB may be a little small. Have you checked whether the IO reading your
data sources is the bottleneck?

> > If you can modify hardware setup, RAID10 (better with many smaller disks
> > than with fewer bigger ones) should help *very* much. Flash-disk-thingies of
> > appropriate size are even better option (as the seek issues are few orders
> > of magnitude smaller problem). Also probably more RAM (unless you full
> > dataset is much smaller than 2 GB, which I doubt).
> >
> > On the other hand, have you tried testing some other filesystems ?
> > I've had much better performance with lots of small files of XFS (but that
> > was on big RAID5, so YMMV), for example.
> >
> > --
> > Opinions above are GNU-copylefted.
> >
> 
> I have not tried XFS, but tried reiserfs. I could not see a large
> difference when compared with mkfs.ext4 -T small. I could see that
> reiser is giving better performance on overwrite, not on new writes.
> some times we overwrite existing image with new ones.
> 
> Now the total files are 50Million, soon (with in an year) it will grow
> to 1 Billion. I know that we should move ahead with the hardware
> upgrades, also files system access is a large concern for us. There
> images are accessed over the internet and expecting a 100 million
> visits every month. For each user we need to transfer at least 3Mb of
> data.

Serving 3MB is about 1000 tiles. This is a total of 100M * 1000 = 1e11
tiles/month or about 40,000 requests per second. If every request needed
an IO from a hard disk managing 100 IOPs then you would need about 400
disks. Having a decent amount of RAM should dramatically cut the number
of request reaching the disks. Alternatively you might be able to do
this all with just a few SSDs. The Intel X25-E is rated at >35,000 IOPs
for random 4kB reads[2].

I can give you some performance numbers about the OSM server for
comparision: At last count the OSM tile server had 568M tiles cached
using about 500GB of disk space[3]. The hardware is described on the
wiki[4]. It regularly serves 500+ tiles per second @ 50Mbps[5]. This is
about 40 million HTTP requests per day and several TB of traffic per
month.

	Jon


1: http://trac.mapnik.org/
2: http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf
3: http://wiki.openstreetmap.org/wiki/Tile_Disk_Usage
4: http://wiki.openstreetmap.org/wiki/Servers/yevaud
5: http://munin.openstreetmap.org/openstreetmap/yevaud.openstreetmap.html





From sandeen at redhat.com  Sun Oct 18 15:34:19 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Sun, 18 Oct 2009 10:34:19 -0500
Subject: optimising filesystem for many small files
In-Reply-To: <84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>	
	<4AD9D599.3000306@redhat.com>	
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>	
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
Message-ID: <4ADB357B.4030008@redhat.com>

Viji V Nair wrote:
> On Sun, Oct 18, 2009 at 3:56 AM, Theodore Tso <tytso at mit.edu> wrote:
>> On Sat, Oct 17, 2009 at 11:26:04PM +0530, Viji V Nair wrote:
>>> these files are not in a single directory, this is a pyramid
>>> structure. There are total 15 pyramids and coming down from top to
>>> bottom the sub directories and files  are multiplied by a factor of 4.
>>>
>>> The IO is scattered all over!!!! and this is a single disk file system.
>>>
>>> Since the python application is creating files, it is creating
>>> multiple files to multiple sub directories at a time.
>> What is the application trying to do, at a high level?  Sometimes it's
>> not possible to optimize a filesystem against a badly designed
>> application.  :-(
> 
> The application is reading the gis data from a data source and
> plotting the map tiles (256x256, png images) for different zoom
> levels. The tree output of the first zoom level is as follows
> 
> /tiles/00
> `-- 000
>     `-- 000
>         |-- 000
>         |   `-- 000
>         |       `-- 000
>         |           |-- 000.png
>         |           `-- 001.png
>         |-- 001
>         |   `-- 000
>         |       `-- 000
>         |           |-- 000.png
>         |           `-- 001.png
>         `-- 002
>             `-- 000
>                 `-- 000
>                     |-- 000.png
>                     `-- 001.png
> 
> in each zoom level the fourth level directories are multiplied by a
> factor of four. Also the number of png images are multiplied by the
> same number.
>> It sounds like it is generating files distributed in subdirectories in
>> a completely random order.  How are the files going to be read
>> afterwards?  In the order they were created, or some other order
>> different from the order in which they were read?
> 
> The application which we are using are modified versions of mapnik and
> tilecache, these are single threaded so we are running 4 process at a
> time. We can say only four images are created at a single point of
> time. Some times a single image is taking around 20 sec to create. I
> can see lots of system resources are free, memory, processors etc
> (these are 4G, 2 x 5420 XEON)
> 
> I have checked the delay in the backend data source, it is on a 12Gbps
> LAN and no delay at all.

The delays are almost certainly due to the drive heads seeking like mad 
as they attempt to write data all over the disk; most filesystems are 
designed so that files in subdirectories are kept together, and new 
subdirectories are placed at relatively distant locations to make room 
for the files they will contain.

In the past I've seen similar applications also slow down due to new 
inode searching heuristics in the inode allocator, but that was on ext3 
and ext4 is significantly different in that regard...

> These images are also read in the same manner.
> 
>> With a sufficiently bad access patterns, there may not be a lot you
>> can do, other than (a) throw hardware at the problem, or (b) fix or
>> redesign the application to be more intelligent (if possible).
>>
>>                                                    - Ted
>>
> 
> The file system is crated with "-i 1024 -b 1024" for larger inode
> number, 50% of the total images are less than 10KB. I have disabled
> access time and given a large value to the commit also. Do you have
> any other recommendation of the file system creation?

I think you'd do better to change, if possible, how the application behaves.

I probably don't know enough about the app but rather than:

/tiles/00
`-- 000
     `-- 000
         |-- 000
         |   `-- 000
         |       `-- 000
         |           |-- 000.png
         |           `-- 001.png

could it do:

/tiles/00/000000000000000000.png
/tiles/00/000000000000000001.png

...

for example?  (or something similar)

-Eric

> Viji



From viji at fedoraproject.org  Sun Oct 18 16:10:28 2009
From: viji at fedoraproject.org (Viji V Nair)
Date: Sun, 18 Oct 2009 21:40:28 +0530
Subject: optimising filesystem for many small files
In-Reply-To: <19163.8956.421374.35494@tree.ty.sabi.co.uk>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
	<19163.8956.421374.35494@tree.ty.sabi.co.uk>
Message-ID: <84c89ac10910180910x4317b37fh986e9ce027fe78c3@mail.gmail.com>

On Sun, Oct 18, 2009 at 7:45 PM, Peter Grandi
<pg_ext3 at ext3.to.sabi.co.uk> wrote:
>>>>> Hi, System : Fedora 11 x86_64 Current Filesystem: 150G ext4
>>>>> (formatted with "-T small" option) Number of files: 50
>>>>> Million, 1 to 30K png images We are generating these files
>>>>> using a python programme and getting very slow IO
>>>>> performance. While generation there in only write, no
>>>>> read. After generation there is heavy read and no write.
>>>>> I am looking for best practices/recommendation to get a
>>>>> better performance. ?Any suggestions of the above are
>>>>> greatly appreciated.
>
> The first suggestion is to report issues in a less vague way.
>
> Perhaps you are actually getting very good performance, but it is
> not sufficient for the needs of your application; or perhaps you
> are really getting poor performance, and emoving the cause would
> make it sufficient for the needs of your application. But no
> information on what is the current and what is the desired
> performance is available, and that should have been the first
> thing stated.

There are two issues mainly.

1. I am generating 50 Million 256 x 256 png images using two
application, mapnik and tilecache. Both the application are open
source and the seeder  programme in tilecache which is used to
precache these tiles from mapnik and gis data source  is single
threaded. So I was running 4 processes and it was taking 20 sec to
create a file. Now I have reduced the number of process to one and I
am getting 6 sec per tile.

2. The goal is to achieve to generate a tile in less than 1 sec. The
backend gis data source is postgres+postgis, the application, mapnik,
is making only one query at a time to generate a single tile. The
postgres is a 50G DB running on 16GB dual xeon boxes.

>
>>>> [ ... ] these files are not in a single directory, this is a
>>>> pyramid structure. There are total 15 pyramids and coming down
>>>> from top to bottom the sub directories and files?are
>>>> multiplied by a factor of 4. The IO is scattered all over!!!!
>>>> [ ... ]
>
> Is that a surprise? First one creates a marvellously pessimized
> data storage scheme, and then "surprise!" IO is totally random
> (and it is likely to be somewhat random at the application level
> too).
>
>>> [ ... ] What is the application trying to do, at a high level?
>>> Sometimes it's not possible to optimize a filesystem against a
>>> badly designed application. ?:-( [ ... ]
>
>> The application is reading the gis data from a data source and
>> plotting the map tiles (256x256, png images) for different zoom
>> levels. The tree output of the first zoom level is as follows in
>> each zoom level the fourth level directories are multiplied by a
>> factor of four. Also the number of png images are multiplied by
>> the same number. Some times a single image is taking around 20
>> sec to create. [ ... ]
>
> Once upon a time in the Land of the Silly Fools some folks wanted
> to store many small records, and being silly fools they worried
> about complicated nonsense like locality of access, index fanout,
> compact representations, caching higher index tree levels, and
> studied indexed files and DBMSes; and others who wanted to store
> large images with various LODs studied ways to have compact,
> progressive representations of those images.
>
> As any really clever programmer and sysadm knows, all those silly
> fools wasted a lot of time because it is very easy indeed instead
> to just represent large small-record image collections as files
> scattered in a many-level directory tree, and LODs as files of
> varying size in subdirectories of that tree. :-)
>
> [ ... ]
>
>>> With a sufficiently bad access patterns, there may not be a lot
>>> you can do, other than (a) throw hardware at the problem, or
>>> (b) fix or redesign the application to be more intelligent (if
>>> possible).
>
> "if possible" here is a big understatement. :-)

>
>> The file system is crated with "-i 1024 -b 1024" for larger
>> inode number, 50% of the total images are less than 10KB.
>> I have disabled access time and given a large value to the
>> commit also.
>
> These are likely to be irrelevant or counteproductive, and do not
> address the two main issues, the acess pattern profile of the
> application and how stunningly pessimized the current setup is.
>
>> Do you have any other recommendation of the file system
>> creation?
>
> Get the application and the system redeveloped by some not-clever
> programmers and sysadms who come from the Land of the Silly Fools
> and thus have heard of indexed files and databases and LOD image
> representations and know why silly fools use them. :-)

I am trying to get some more hardware, SSD is not possible now. I am
tring to get SAS 15k disks with more spindles.
Now the image tiles are 50Million, with in an year it will become
1Billion, we will be receiving UGC/Satellite images as well, so with
in couple of years
the total image size will be close to 4TB :). So started thinking
about the scalability/performance issues....,

as suggested I will be searching for some silly fools to design and
deploy the same with me .......:)

>
> Since that recommendation is unlikely to happen (turkeys don't
> vote for Christmas...), the main alternative is use some kind of
> SLC SSD (e.g. recent Intel 160GB one) so as to minimize the impact
> of a breathtakingly pessimized design thanks to a storage device
> that can do very many more IOP/s than a hard disk. On a flash SSD
> I would suggest using 'ext2' (or NILFS2, and I wish that UDF were
> in a better state):
>
> ?http://www.storagesearch.com/ssd.html
> ?http://club.cdfreaks.com/f138/ssd-faq-297856/
> ?http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=3607&p=4
>
> Actually I would use an SSD even with an indexed file or a DBMS
> and a LOD friendly image representation, because while that will
> avoid the pessimized 15-layer index tree of directories, even in
> the best of cases the app looks like having extremly low locality
> of reference for data, and odds are that there will be 1-3 IOPs
> per image cess (while probably currently there are many more).
>
> The minor alternative is to use a file system like ReiserFS that
> uses index trees internally and handles particularly well file
> "tails", and also spread the excitingly pessimized IOP load across
> a RAID5 (this application seems one of the only 2 cases where a
> RAID5 makes sense), not a single disk. A nice set of low access
> time 2.5" SAS drives might be the best choice. But considering the
> cost of a flash 160GB SSD today, I'd go for a flash SSD drive (or
> a small RAID of those) and a suitable fs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>



From viji at fedoraproject.org  Sun Oct 18 16:29:12 2009
From: viji at fedoraproject.org (Viji V Nair)
Date: Sun, 18 Oct 2009 21:59:12 +0530
Subject: optimising filesystem for many small files
In-Reply-To: <1255878457.27380.138.camel@localhost.localdomain>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
	<20091018114100.GA26721@eagle102.home.lan>
	<84c89ac10910180614l5d2d476ehb91d210820761039@mail.gmail.com>
	<1255878457.27380.138.camel@localhost.localdomain>
Message-ID: <84c89ac10910180929t2bebfd3eq26eb318475a24fd4@mail.gmail.com>

On Sun, Oct 18, 2009 at 8:37 PM, Jon Burgess <jburgess777 at googlemail.com> wrote:
> On Sun, 2009-10-18 at 18:44 +0530, Viji V Nair wrote:
>> On Sun, Oct 18, 2009 at 5:11 PM, Matija Nalis <mnalis-ml at voyager.hr> wrote:
>> > On Sun, Oct 18, 2009 at 03:01:46PM +0530, Viji V Nair wrote:
>> >> The application which we are using are modified versions of mapnik and
>> >> tilecache, these are single threaded so we are running 4 process at a
>> >
>> > How does it scale if you reduce the number or processes - especially if you
>> > run just one of those ? As this is just a single disk, 4 simultaneous
>> > readers/writers would probably *totally* kill it with seeks.
>> >
>> > I suspect it might even run faster with just 1 process then with 4 of
>> > them...
>>
>> with one process it is giving me 6 seconds
>
> That seems a little slow. Have you looked in optimising your mapnik
> setup? The mapnik-users list or IRC channel is a good place to ask[1].
>
> For comparison, the OpenStreetMap tile server typically renders a 8x8
> block of 64 tiles in about 1 second, although the time varies greatly
> depending on the amount of data within the tiles.
>
>> >
>> >> time. We can say only four images are created at a single point of
>> >> time. Some times a single image is taking around 20 sec to create. I
>> >
>> > is that 20 secs just the write time for an precomputed file of 10k ?
>> > Or does it also include reading and processing and writing ?
>>
>> this include processing and writing
>>
>> >
>> >> can see lots of system resources are free, memory, processors etc
>> >> (these are 4G, 2 x 5420 XEON)
>
> 4GB may be a little small. Have you checked whether the IO reading your
> data sources is the bottleneck?

I will be upgrading the RAM, but I didn't see any swap usage while
running this applications...
the data source is on a different machine, postgres+postgis. I have
checked the IO, looks fine. It is a 50G DB running on 16GB dual xeon
box

>
>> > If you can modify hardware setup, RAID10 (better with many smaller disks
>> > than with fewer bigger ones) should help *very* much. Flash-disk-thingies of
>> > appropriate size are even better option (as the seek issues are few orders
>> > of magnitude smaller problem). Also probably more RAM (unless you full
>> > dataset is much smaller than 2 GB, which I doubt).
>> >
>> > On the other hand, have you tried testing some other filesystems ?
>> > I've had much better performance with lots of small files of XFS (but that
>> > was on big RAID5, so YMMV), for example.
>> >
>> > --
>> > Opinions above are GNU-copylefted.
>> >
>>
>> I have not tried XFS, but tried reiserfs. I could not see a large
>> difference when compared with mkfs.ext4 -T small. I could see that
>> reiser is giving better performance on overwrite, not on new writes.
>> some times we overwrite existing image with new ones.
>>
>> Now the total files are 50Million, soon (with in an year) it will grow
>> to 1 Billion. I know that we should move ahead with the hardware
>> upgrades, also files system access is a large concern for us. There
>> images are accessed over the internet and expecting a 100 million
>> visits every month. For each user we need to transfer at least 3Mb of
>> data.
>
> Serving 3MB is about 1000 tiles. This is a total of 100M * 1000 = 1e11
> tiles/month or about 40,000 requests per second. If every request needed
> an IO from a hard disk managing 100 IOPs then you would need about 400
> disks. Having a decent amount of RAM should dramatically cut the number
> of request reaching the disks. Alternatively you might be able to do
> this all with just a few SSDs. The Intel X25-E is rated at >35,000 IOPs
> for random 4kB reads[2].
>
> I can give you some performance numbers about the OSM server for
> comparision: At last count the OSM tile server had 568M tiles cached
> using about 500GB of disk space[3]. The hardware is described on the
> wiki[4]. It regularly serves 500+ tiles per second @ 50Mbps[5]. This is
> about 40 million HTTP requests per day and several TB of traffic per
> month.
>
> ? ? ? ?Jon
>
>
> 1: http://trac.mapnik.org/
> 2: http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf
> 3: http://wiki.openstreetmap.org/wiki/Tile_Disk_Usage
> 4: http://wiki.openstreetmap.org/wiki/Servers/yevaud
> 5: http://munin.openstreetmap.org/openstreetmap/yevaud.openstreetmap.html
>
>
>

I have to give a try on mod_tile. Do you have any suggestion on using
nginx/varnish as a cahce layer?



From viji at fedoraproject.org  Sun Oct 18 16:33:42 2009
From: viji at fedoraproject.org (Viji V Nair)
Date: Sun, 18 Oct 2009 22:03:42 +0530
Subject: optimising filesystem for many small files
In-Reply-To: <4ADB357B.4030008@redhat.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
	<4ADB357B.4030008@redhat.com>
Message-ID: <84c89ac10910180933p3ddb9947ye464a19ba29e4ccc@mail.gmail.com>

On Sun, Oct 18, 2009 at 9:04 PM, Eric Sandeen <sandeen at redhat.com> wrote:
> Viji V Nair wrote:
>>
>> On Sun, Oct 18, 2009 at 3:56 AM, Theodore Tso <tytso at mit.edu> wrote:
>>>
>>> On Sat, Oct 17, 2009 at 11:26:04PM +0530, Viji V Nair wrote:
>>>>
>>>> these files are not in a single directory, this is a pyramid
>>>> structure. There are total 15 pyramids and coming down from top to
>>>> bottom the sub directories and files ?are multiplied by a factor of 4.
>>>>
>>>> The IO is scattered all over!!!! and this is a single disk file system.
>>>>
>>>> Since the python application is creating files, it is creating
>>>> multiple files to multiple sub directories at a time.
>>>
>>> What is the application trying to do, at a high level? ?Sometimes it's
>>> not possible to optimize a filesystem against a badly designed
>>> application. ?:-(
>>
>> The application is reading the gis data from a data source and
>> plotting the map tiles (256x256, png images) for different zoom
>> levels. The tree output of the first zoom level is as follows
>>
>> /tiles/00
>> `-- 000
>> ? ?`-- 000
>> ? ? ? ?|-- 000
>> ? ? ? ?| ? `-- 000
>> ? ? ? ?| ? ? ? `-- 000
>> ? ? ? ?| ? ? ? ? ? |-- 000.png
>> ? ? ? ?| ? ? ? ? ? `-- 001.png
>> ? ? ? ?|-- 001
>> ? ? ? ?| ? `-- 000
>> ? ? ? ?| ? ? ? `-- 000
>> ? ? ? ?| ? ? ? ? ? |-- 000.png
>> ? ? ? ?| ? ? ? ? ? `-- 001.png
>> ? ? ? ?`-- 002
>> ? ? ? ? ? ?`-- 000
>> ? ? ? ? ? ? ? ?`-- 000
>> ? ? ? ? ? ? ? ? ? ?|-- 000.png
>> ? ? ? ? ? ? ? ? ? ?`-- 001.png
>>
>> in each zoom level the fourth level directories are multiplied by a
>> factor of four. Also the number of png images are multiplied by the
>> same number.
>>>
>>> It sounds like it is generating files distributed in subdirectories in
>>> a completely random order. ?How are the files going to be read
>>> afterwards? ?In the order they were created, or some other order
>>> different from the order in which they were read?
>>
>> The application which we are using are modified versions of mapnik and
>> tilecache, these are single threaded so we are running 4 process at a
>> time. We can say only four images are created at a single point of
>> time. Some times a single image is taking around 20 sec to create. I
>> can see lots of system resources are free, memory, processors etc
>> (these are 4G, 2 x 5420 XEON)
>>
>> I have checked the delay in the backend data source, it is on a 12Gbps
>> LAN and no delay at all.
>
> The delays are almost certainly due to the drive heads seeking like mad as
> they attempt to write data all over the disk; most filesystems are designed
> so that files in subdirectories are kept together, and new subdirectories
> are placed at relatively distant locations to make room for the files they
> will contain.
>
> In the past I've seen similar applications also slow down due to new inode
> searching heuristics in the inode allocator, but that was on ext3 and ext4
> is significantly different in that regard...
>
>> These images are also read in the same manner.
>>
>>> With a sufficiently bad access patterns, there may not be a lot you
>>> can do, other than (a) throw hardware at the problem, or (b) fix or
>>> redesign the application to be more intelligent (if possible).
>>>
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? - Ted
>>>
>>
>> The file system is crated with "-i 1024 -b 1024" for larger inode
>> number, 50% of the total images are less than 10KB. I have disabled
>> access time and given a large value to the commit also. Do you have
>> any other recommendation of the file system creation?
>
> I think you'd do better to change, if possible, how the application behaves.
>
> I probably don't know enough about the app but rather than:
>
> /tiles/00
> `-- 000
> ? ?`-- 000
> ? ? ? ?|-- 000
> ? ? ? ?| ? `-- 000
> ? ? ? ?| ? ? ? `-- 000
> ? ? ? ?| ? ? ? ? ? |-- 000.png
> ? ? ? ?| ? ? ? ? ? `-- 001.png
>
> could it do:
>
> /tiles/00/000000000000000000.png
> /tiles/00/000000000000000001.png
>
> ...
>
> for example? ?(or something similar)
>
> -Eric

The tilecache application is creating these directory structure, we
need to change it and our application for a new directory tree.

>
>> Viji
>
>



From jburgess777 at googlemail.com  Sun Oct 18 17:15:05 2009
From: jburgess777 at googlemail.com (Jon Burgess)
Date: Sun, 18 Oct 2009 18:15:05 +0100
Subject: optimising filesystem for many small files
In-Reply-To: <84c89ac10910180929t2bebfd3eq26eb318475a24fd4@mail.gmail.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
	<20091018114100.GA26721@eagle102.home.lan>
	<84c89ac10910180614l5d2d476ehb91d210820761039@mail.gmail.com>
	<1255878457.27380.138.camel@localhost.localdomain>
	<84c89ac10910180929t2bebfd3eq26eb318475a24fd4@mail.gmail.com>
Message-ID: <1255886105.27380.158.camel@localhost.localdomain>

On Sun, 2009-10-18 at 21:59 +0530, Viji V Nair wrote:
> On Sun, Oct 18, 2009 at 8:37 PM, Jon Burgess <jburgess777 at googlemail.com> wrote:
> > On Sun, 2009-10-18 at 18:44 +0530, Viji V Nair wrote:
> >> >
> >> >> can see lots of system resources are free, memory, processors etc
> >> >> (these are 4G, 2 x 5420 XEON)
> >
> > 4GB may be a little small. Have you checked whether the IO reading your
> > data sources is the bottleneck?
> 
> I will be upgrading the RAM, but I didn't see any swap usage while
> running this applications...
> the data source is on a different machine, postgres+postgis. I have
> checked the IO, looks fine. It is a 50G DB running on 16GB dual xeon
> box

Going into swap is not the issue. If you have extra RAM available then
the OS will use this as a disk cache which means the DB will be able to
access indexes etc without needing to wait for the disk every time. 16GB
of RAM for a 50GB DB should be sufficient if the data is sensibly
indexed.

> I have to give a try on mod_tile. Do you have any suggestion on using
> nginx/varnish as a cahce layer?

There have been some tests using squid as a cache in front of mod_tile.
This worked reasonably well but did not give a big performance increase
because the server was already able to handle the load without an
additional cache. If you want to discuss this further then I'd suggest
continuing the conversation on the OSM or Mapnik lists.

	Jon




From pg_ext3 at ext3.for.sabi.co.uk  Sun Oct 18 14:15:24 2009
From: pg_ext3 at ext3.for.sabi.co.uk (Peter Grandi)
Date: Sun, 18 Oct 2009 15:15:24 +0100
Subject: optimising filesystem for many small files
In-Reply-To: <84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
Message-ID: <19163.8956.421374.35494@tree.ty.sabi.co.uk>

>>>> Hi, System : Fedora 11 x86_64 Current Filesystem: 150G ext4
>>>> (formatted with "-T small" option) Number of files: 50
>>>> Million, 1 to 30K png images We are generating these files
>>>> using a python programme and getting very slow IO
>>>> performance. While generation there in only write, no
>>>> read. After generation there is heavy read and no write.
>>>> I am looking for best practices/recommendation to get a
>>>> better performance.  Any suggestions of the above are
>>>> greatly appreciated.

The first suggestion is to report issues in a less vague way.

Perhaps you are actually getting very good performance, but it is
not sufficient for the needs of your application; or perhaps you
are really getting poor performance, and emoving the cause would
make it sufficient for the needs of your application. But no
information on what is the current and what is the desired
performance is available, and that should have been the first
thing stated.

>>> [ ... ] these files are not in a single directory, this is a
>>> pyramid structure. There are total 15 pyramids and coming down
>>> from top to bottom the sub directories and files?are
>>> multiplied by a factor of 4. The IO is scattered all over!!!!
>>> [ ... ]

Is that a surprise? First one creates a marvellously pessimized
data storage scheme, and then "surprise!" IO is totally random
(and it is likely to be somewhat random at the application level
too).

>> [ ... ] What is the application trying to do, at a high level?
>> Sometimes it's not possible to optimize a filesystem against a
>> badly designed application. ?:-( [ ... ]

> The application is reading the gis data from a data source and
> plotting the map tiles (256x256, png images) for different zoom
> levels. The tree output of the first zoom level is as follows in
> each zoom level the fourth level directories are multiplied by a
> factor of four. Also the number of png images are multiplied by
> the same number. Some times a single image is taking around 20
> sec to create. [ ... ]

Once upon a time in the Land of the Silly Fools some folks wanted
to store many small records, and being silly fools they worried
about complicated nonsense like locality of access, index fanout,
compact representations, caching higher index tree levels, and
studied indexed files and DBMSes; and others who wanted to store
large images with various LODs studied ways to have compact,
progressive representations of those images.

As any really clever programmer and sysadm knows, all those silly
fools wasted a lot of time because it is very easy indeed instead
to just represent large small-record image collections as files
scattered in a many-level directory tree, and LODs as files of
varying size in subdirectories of that tree. :-)

[ ... ]

>> With a sufficiently bad access patterns, there may not be a lot
>> you can do, other than (a) throw hardware at the problem, or
>> (b) fix or redesign the application to be more intelligent (if
>> possible).

"if possible" here is a big understatement. :-)

> The file system is crated with "-i 1024 -b 1024" for larger
> inode number, 50% of the total images are less than 10KB.
> I have disabled access time and given a large value to the
> commit also.

These are likely to be irrelevant or counteproductive, and do not
address the two main issues, the acess pattern profile of the
application and how stunningly pessimized the current setup is.

> Do you have any other recommendation of the file system
> creation?

Get the application and the system redeveloped by some not-clever
programmers and sysadms who come from the Land of the Silly Fools
and thus have heard of indexed files and databases and LOD image
representations and know why silly fools use them. :-)

Since that recommendation is unlikely to happen (turkeys don't
vote for Christmas...), the main alternative is use some kind of
SLC SSD (e.g. recent Intel 160GB one) so as to minimize the impact
of a breathtakingly pessimized design thanks to a storage device
that can do very many more IOP/s than a hard disk. On a flash SSD
I would suggest using 'ext2' (or NILFS2, and I wish that UDF were
in a better state):

  http://www.storagesearch.com/ssd.html
  http://club.cdfreaks.com/f138/ssd-faq-297856/
  http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=3607&p=4

Actually I would use an SSD even with an indexed file or a DBMS
and a LOD friendly image representation, because while that will
avoid the pessimized 15-layer index tree of directories, even in
the best of cases the app looks like having extremly low locality
of reference for data, and odds are that there will be 1-3 IOPs
per image cess (while probably currently there are many more).

The minor alternative is to use a file system like ReiserFS that
uses index trees internally and handles particularly well file
"tails", and also spread the excitingly pessimized IOP load across
a RAID5 (this application seems one of the only 2 cases where a
RAID5 makes sense), not a single disk. A nice set of low access
time 2.5" SAS drives might be the best choice. But considering the
cost of a flash 160GB SSD today, I'd go for a flash SSD drive (or
a small RAID of those) and a suitable fs.



From darkonc at gmail.com  Mon Oct 19 07:23:30 2009
From: darkonc at gmail.com (Stephen Samuel (gmail))
Date: Mon, 19 Oct 2009 00:23:30 -0700
Subject: optimising filesystem for many small files
In-Reply-To: <84c89ac10910180608v76caf7f5y4837ccaf6a66a594@mail.gmail.com>
References: <84c89ac10910162352x5cdeca37icfbf0af2f2325d7c@mail.gmail.com>
	<4AD9D599.3000306@redhat.com>
	<84c89ac10910171056i773dfb93wc2e917a086dd8ef0@mail.gmail.com>
	<20091017222619.GA10074@mit.edu>
	<84c89ac10910180231p202fb5f1r2e192e9ac0b51509@mail.gmail.com>
	<20091018114100.GA26721@eagle102.home.lan>
	<84c89ac10910180608v76caf7f5y4837ccaf6a66a594@mail.gmail.com>
Message-ID: <6cd50f9f0910190023i5d719543n21862725c294aef3@mail.gmail.com>

On Sun, Oct 18, 2009 at 6:08 AM, Viji V Nair <viji at fedoraproject.org> wrote:

> From: Matija Nalis <mnalis-ml at voyager.hr>
>
> On Sun, Oct 18, 2009 at 03:01:46PM +0530, Viji V Nair wrote:
> > The application which we are using are modified versions of mapnik and
> > tilecache, these are single threaded so we are running 4 process at a
>
> How does it scale if you reduce the number or processes - especially if you
> run just one of those ? As this is just a single disk, 4 simultaneous
> readers/writers would probably *totally* kill it with seeks.
>
> I suspect it might even run faster with just 1 process then with 4 of
> them...
>
> with one process it is giving me 6 seconds
>

If it takes 6 seconds with one process and 20 seconds with 4 processes, then
this
pretty clearly points to problems with thrashing the heads.
(this presumes that the timing you're mentioning is time between  request
and
service with the same request patterns).

Others have suggested flash drives... This sounds like an idea.  On the
cheaper end, I'd suggest  lots of mirroring.. The more drives the merrier. a
4-way mirrir will probably give you a good deal of speedup.   If you can
find an old RAID box, try throwing in a dozen or so smaller drives (even 72
or 36GB SCSI drives)..

It sounds like the problem is clearly head seek, not transfer speeds, so
lots of old SCSI drives with a single (slower) connection will probably do
you more good than 4 demon-fast SATA drives.
If you have smaller drives, go to raid10. If you have larger drives, then go
to raid 1 and mirror them up the wazoo.  (I'm presuming that this is a
read-mostly application.  If you're doing lots of parallel writes, then raid
10 might still be a good idea, even with big drives).

If you already have a deep mirror and you later get  Flash drives, then I'd
say add the flash drives into the mix, rather than just replacing the RAID
with flash, again -- unless this isn't a read-mostly situation -- the more
drives the merrier.

Stephen Samuel http://www.bcgreen.com  Software, like love,
778-861-7641                              grows when you give it away
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20091019/b611b4c8/attachment.htm>