[Linux-cluster] GFS Feature Question

Mon Oct 8 17:45:06 UTC 2007

On Mon, 8 Oct 2007, Steven Whitehouse wrote:

>>>> I stumbled upon an old document from back in 2000 (before RedHat acquired
>>>> Sistina), and they were talking about a number of features for the "next
>>>> version", including shadowing/copy-on-write.
>>>>
>>>> The two features I am particularly interested in are:
>>>>
>>>> 1) Compression
>>>> I consider this to be important both for performance reasons and the fact
>>>> that no matter how cheap, disks will always be more expensive.
>>>> Performance-wise, at some point I/O becomes the bottleneck. Not
>>>> necessarily the disk I/O but network I/O of the SAN, especially when all
>>>> the nodes in the cluster are sharing the same SAN bandwidth. At that
>>>> point, reducing the data volume through compression becomes a performance
>>>> win. This point isn't all that difficult to reach even on a small cluster
>>>> on gigabit ethernet.
>>>>
>>> There are really two issues here rather than one:
>>> 1. Compression of data
>>> Has, as a prerequisite, "allocate on flush" as we would really need
>>> "compress on flush" in order to make this a viable option. Also we'd
>>> need hints as to what kind of data we are looking at in order to make it
>>> worthwhile. We'd also have to look at crypto too since you can't
>>> compress encrypted data, the compression must come first if its
>>> required.
>>
>> Sure, but this is hardly a difficult problem. It could be based on any of:
>
> I would argue that it is a difficult problem for a number of reasons...

Not really. It's only CPU time. ;-)

>> 1) file extension, perhaps listed somewhere in the /etc directory, and
>> only read on boot-up, or even provided as a comma-separated list to the
>> module/kernel at load-time (a file would be nicer, though).
>
> Which requires not only loading the list into the kernel, but what
> happens if someone creates a link to foo.png called foo.txt or foo.ps?

If the chattr flag says to compress, we compress. The computer should not 
be overriding the operator. If it gets bigger through compression (i.e. 
already compressed or encrypted), then you just save the block 
uncompressed and flag the block as such.

>> 2) Completely transparently based on a similar heuristic to what Reiser4
>> uses. For each file, try to compress the first 64KB. If it yields a
>> reasonable result, compress the rest, otherwise, flag as uncompressed
>> and don't bother. The user could override this by the appropriate chattr
>> command.
>
> That assumes that you always have the "first 64k" available (in cache)
> and that its not a hole in the file for example.

I'm actually very much in favour of just leaving it up to the user. So 
what if 1% of the files end up being uncompressible? If we are concerned 
about compression, we are clearly not concerned about every last CPU cycle 
that might end up getting wasted on the odd block of uncompressible data. 
If the user put the compress flag on the file (or directory), the the FS 
should try to compress, and if it doesn't manage to reduce the size, the 
save uncompressed.

I don't really see a problem with that.

>> 3) Leave it entirely up to the user - just inherit compression flag from
>> the parent directory. If the user says to compress, then don't question
>> it.
>>
> Ok, but that still leaves a number of problems to resolve: firstly we
> need to be able to ensure that the compression doesn't result in
> expansion of the data.

Inode flag to say whether the inode contains compressed or uncompressed 
data?

> Whatever system we use we'd have to be able to
> turn off compression in that case, otherwise block allocation would
> become almost impossible as we'd not be able to put a reasonable max
> bound on the number of blocks used.

Agreed. We'd need meta-data to tell which inodes are compressed and which 
aren't.

> Also the compression isn't likely to be very good if it can't tune
> itself to the particular file content.

It's a heuristic. If it does the job for most people most of the time, 
it's good enough. Say I have a big email system with 1 TB of maildirs from 
1M users on it. It's email. Most email is text. So we set the compression 
flags on everything to on, and just get on with it. Sure, some of it will 
be encrypted, and thus uncompressible, but so what if we still save 50% of 
the space?

>> 3) would be the simplest, and probably most useful. The only time
>> when a block should be left uncompressed is when compressing it makes it
>> get bigger.
>>
> The other thing that would have to be decided is the size of a "block"
> in this case. Too large and random access will be slow and cumbersome,
> too small and the benefit from compression will be less.

Not really. Just go with whatever the default block size is. Default on 
most file systems on Linux seens to be 4K (probably something to do with 
the page size on x86 ;-) ). This seems to yield reasonable results in 
e2compr (I think Reiser4 is a bit different, not sure), and NTFS uses 4K 
blocks by default, and the compression it achieves is certainly good 
enough to be useful.

> So although all of those problems are solvable, given time, I would
> still say that that it is not an easy thing to do.

It's a problem that has been solved by varioud other FS-es "well enough". 
There is no such thing as perfect compression. But a simple approach is 
good enough to be extremely useful for most people.

I am actually toying with the idea of home brewing an iSCSI san that 
exports a sparse file as a partition, and has that file sit on top of a 
Reiser4 compressed file system. This would actually solve the need to have 
GFS compressed, as the underlying FS would be compressed. It's not 
particularly neat, but it solves the problem. And the caching the SAN (or 
NAS, principle is the same) does would help overcome the compression speed 
overheads.

>>> 2. Compression of metadata
>>> This might well be worth looking into. There is a considerable amount
>>> of redundancy in typical fs metadata, and we ought to be able to reduce
>>> the number of blocks we have to read/write in order to complete an
>>> operation in this way. Using extents for example could be considered a
>>> form of metadata compression. The main problem is that our "cache line"
>>> if you like in GFS(2) is one disk block, so that sharing between nodes
>>> is a problem (hence the one inode per block rule we have at the moment).
>>> We'd need to address the metadata migration issue first.
>>
>> I'm not sure I understand what the problem is here. How is caching a
>> problem any more than it would otherwise be - considering we have multiple
>> nodes doing r/w ops on the same FS?
>>
> Because the data structures are carefully designed to minimimse the
> times for which two nodes will want to access the same block. Of course
> that still happens in some cases, but provided the nodes are not all
> working in the same directory (or if they are, then the workload is
> mostly readonly) then largely there is little contention.
>
> As soon as you start (for example) having multiple inodes in the same
> block, then the probability of sharing two items of data which are
> required by different nodes at the same time goes up.

Sure - so, you'd want to compress just before writing to disk. Although 
there could be performance benefit to having the cache compressed, too, if 
memory is short, especially with a fast decompression algorithm like LZO.

>>> Neither of the above is likely to happen soon though as they both
>>> require on-disk format changes.
>>
>> Compatibility is already broken between GFS1 and GFS2. I don't see this as
>> an issue. The FS will get mounted with whatever parameters it was created
>> - and a new FS can be created with compression enabled.
>>
> Mostly the data structures between GFS1 and GFS2 are the same. There are
> a few differences, but thats mainly down to the addition of the metadata
> file system (which has identical on-disk format to the main GFS2) and
> one or two fields in the inode (the common fields are at the same
> offsets). The format for journalled files has changed, but only to be
> the same as that for non-journalled files, so its not a huge change
> really.

Maybe not - but slightly incompatible or completely incompatible is still 
incompatible. It's re-format the file system time. :-)

>>>> 2) Shadowing/Copy-On-Write File Versioning
>>>> Backups have 2 purposes - retrieving a file that was lost or corrupted
>>>> through user error, and files lost or corrupted through disk failure. High
>>>> levels of RAID alleviate the need for backup for the latter reason, but
>>>> they do nothing to alleviate user-error caused damage. At the same time
>>>> SANs can get big - I don't see hundreds of TB to be an inconcievable size.
>>>> At this size, backups become an issue. Thus, a feature to provide file
>>>> versioning is important.
>>>>
>>>> In turn, 2) increases the volume of data, which increases the need for 1).
>>>>
>>>> Are either of these two features planned for GFS in the near future?
>>
>>> This also requires on-disk format changes,
>>
>> I don't remember implying that it wouldn't. But at the same time, why
>> would this be a problem? It's not like it means that people won't be able
>> to mount their GFS2 FS as they can now. And it's not like GFS2 works at
>> the moment, anyway (not with the latest packaged releases on any of the
>> spawns of RH (Fedore/CentOS, etc.)! :-p
>>
> For the moment we are trying to not make changes to the on-disk format,
> and in fact there haven't been any for a long time now. We have made
> some major steps forward in stability recently and those are due to roll
> into the distros fairly shortly now, so the last thing we want to do is
> to change things at this stage.

Sure, I understand that. I was suggesting this should be in the next 
stable version of GFS2. But it would mighty nice to have versioning and 
compression in GFS3. :-)

> Thats not to say that we won't come back and revisit the ideas later on
> though, but its not top of our list right now.
>
>>> but I agree that it would be
>>> a nice thing to do. Its very much in my mind though as to what a
>>> suitable scheme would be. We do have an ever increasing patent minefield
>>> to walk through here too I suspect.
>>
>> I very much doubt it. There are several OSS non-cluster FSs that provide
>> copy-on-write file versioning, and this has been used since the days of
>> VMS - which was now long enough ago that patents would have long since
>> expired.
>>
>>> Potentially it would be possible to address both of the above
>>> suggestions (minus the metadata compression) by using a stacking
>>> filesystem. That would be potentially more flexible by introducing the
>>> features on all filesystems not just GFS(2),
>>
>> Can you explain what you mean by stackable? I would have thought that
>> having a stacked file system on top of GFS would break GFS' ability to
>> function correctly in a clustered environment (not to mention introduce
>> unnecessary overheads).
>>
>
> I'm thinking of filesystems like (for example) unionfs which pass
> certain operations through to the filesystem(s) underneath it. Depending
> on how this is implemented, it need not be particularly inefficient. It
> wouldn't affect how GFS2 works any more than it would affect any other
> filesystem, although if locking were required (for example) in the
> clustered case, then the higher level filesystem would be just as able
> to use the DLM (for example) as any other kernel module or userland
> application, so that shouldn't be a barrier,

Interesting idea. In the meantime, I think I'll just try compressing the 
raw storage medium as mentioned above. But the big feature would really be 
copy-on-write journalling. Removing the need for backups on a big file 
system would be extremely useful. Sadly, though, that wouldn't stack as 
nicely as the compression does.

Gordan