[Linux-cluster] gfs2 v. zfs?

Wed Jan 26 10:19:27 UTC 2011

Hi,

On Tue, 2011-01-25 at 09:16 -0800, Wendy Cheng wrote:
> On Tue, Jan 25, 2011 at 2:01 AM, Steven Whitehouse <swhiteho at redhat.com> wrote:
> 
> >> On Mon, Jan 24, 2011 at 6:55 PM, Jankowski, Chris
> >> <Chris.Jankowski at hp.com> wrote:
> >> > A few comments, which might contrast uses of GFS2 and XFS in enterprise class production environments:
> >> >
> >> > 3.
> >> > GFS2 provides only tar(1) as a backup mechanism.
> >> > Unfortunately, tar(1) does not cope efficiently with sparse files,
> >> > which many applications create.
> >> > As an exercise create a 10 TB sparse file with just one byte of non-null data at the end.
> >> > Then try to back it up to disk using tar(1).
> >> > The tar image will be correctly created, but it will take many, many hours.
> >> > Dump(8) would do the job in a blink, but is not available for GFS2 filesystem.
> >> > However, XFS does have XFS specific dump(8) command and will backup sparse files
> >> > efficiently.
> >> >
> > You don't need dump in order to do this (since dump reads directly from
> > the block device itself, that would be problematic on GFS/GFS2 anyway).
> > All that is required is a backup too which support the FIEMAP ioctl. I
> > don't know if that has made it into tar yet, I suspect probably not.
> >
> 
> If cluster snapshot is in the hand of another develop team (that may
> not see it as a high priority), a GFS2 specific dump command could be
> a good alternative. The bottom line here is GFS2 is lacking a sensible
> (read as "easy to use") backup strategy that can significantly
> jeopardize its deployment.
> 
> Of couse, this depends on .... someone has to be less stubborn and
> willing to move GFS2's inode number away from its physical disk block
> number. Cough !
> 
> -- Wendy
> 
I don't know of any reason why the inode number should be related to
back up. The reason why it was suggested that the inode number should be
independent of the physical block number was in order to allow
filesystem shrink without upsetting (for example) NFS which assumed that
its filehandles are valid "forever".

The problem with doing that is that it adds an extra layer of
indirection (and one which had not been written in gfs2 at the point in
time we took that decision). That extra layer of indirection means more
overhead on every lookup of the inode. It would also be a contention
point in a distributed filesystem, since it would be global state.

The dump command directly accesses the filesystem via the block device
which is a problem for GFS2, since there is no guarantee (and in general
it won't be) that the information read via this method will match the
actual content of the filesystem. Unlike ext2/3 etc., GFS2 caches its
metadata in per-inode address spaces which are kept coherent using
glocks. In ext2/3 etc., the metadata is cached in the block device
address space which is why dump can work with them.

With GFS2 the only way to ensure that the block device was consistent
would be to umount the filesystem on all nodes. In that case it is no
problem to simply copy the block device using dd, for example. So dump
is not required.

Ideally we want backup to be online (i.e. with the filesystem mounted),
and we also do not want it to disrupt the workload which the cluster was
designed for, so far as possible. So the best solution is to back up
files from the node which is most likely to be caching them. That also
means that the backup can proceed in parallel across the nodes, reducing
the time taken.

It does mean that a bit more thought has to go into it, since it may not
be immediately obvious what the working set of each node actually is.
Usually though, it is possible to make a reasonable approximation of it,

Steve.