Opinions on new Fedora Core 2 install with LVM 2 and snapshots?

Mon Jul 26 19:45:26 UTC 2004

[ Thank you very much for your response ]

"Bill Rugolsky Jr." wrote:  
> There are fundamental differences between what a NetApp filer is
> doing, and what LVM2 snapshots provide.

Yeah, it's hard to beat WAFL's well-integrated design.

> In particular, when using LVM2 snapshots, kcopyd has to constantly
> move blocks from your filesystem LV to the snapshot LV.  Device Mapper
> is much more sensible and efficient at this than LVM1,

So I don't even want to look at LVM1, good.

> but it is still non-trivial overhead, and ends up generating a lot
> of mixed read/write traffic.

That's what I figured.

> We are currently using NFS/Ext3/LVM2/MD on a 2.6.8-rc1 kernel as our
> backup NFS server,

That's going to be my usage, as a backup NFS server to a _real_ NetApp
filer.  It's largely more for Windows users than UNIX clients, but I'll
still need some production NFS support.

> and initial testing with snapshots under load uncovered some
> performance problems that I need to track down.

What kind of memory-I/O do you have in your system?
I'm hoping to do this on 1GB of RAM, but it's not my primary NFS server.

> [Snapshots and mirroring were only recently added to the Device Mapper
> code in the Linus kernel tree.]

Yep, I saw that.  I also noticed the Red Hat 2.6.7 development kernels
are now patching them in (or are 2.6.8RC-based?).

> Either grab the most recent kernel from kernel.org, or an FC3 development
> kernel, and test extensively.

I can deal with performance issues.  If they get bad enough, I'll just
not use snapshots and enable them later when they get the quirks worked out.

> The NetApp WAFL filesystem encapsulates all meta-data in a tree structure,
> and uses persistent copy-on-write multi-rooted trees.  When writing, it
> places data wherever it is convenient (i.e., in the free space), and then
> adjusts block pointers up toward the root of the tree.  Every few seconds
> it checkpoints its state (i.e., takes a snapshot).

Yep.  It's not using disseparate volume management from filesystem, WALF is
an "all-in-one" for great efficiency.

> [The NetApp also uses NVRAM to hold state that hasn't been flushed to
> disk.] 

I've done similar with 1GB PCI NVRAM boards, using it as an off-device
full-data Ext3 journal.  Makes NFS v3 sync performance far better.

> When one wants to save a snapshot, the filesystem tags it and maintains
> its allocation data, instead of releasing stale blocks back into the free
> pool.

Right.

> Based on what I've read of Reiser4, the design should allow a similar
> level of functionality to be incorporated at some point.  Unfortunately,
> it is not done yet.

I've seen ReiserFS v4 promise a lot, but compatibility always seems to be
an issue.  I'll stick with XFS.

> To summarize: LVM2 will do what you want (modulo some tuning and
> perhaps bug fixes), but it is not an NetApp.

Yeah, it's not WAFL.  But if it works, that's what I want.  I'm only
concerned about data integrity, not performance, since it is my backup
NFS server.

> IIRC, XFS does not do data journaling.  So while it may be much
> faster than Ext3, you need to consider data integrity.

I use Ext3 in meta-data journaling mode (ordered writes), so I don't
see that much difference.  I was just mentioning XFS in case it is
considered a better option, especially if SGI has a GPL 
for LVM2 on Linux.  But I assume not.

> I haven't been following EVMS development, but you might want
> to look into the current state of affairs to find out if there
> is any functionality there that you need (e.g., badblock handling).

I _always_ use hardware RAID, so badblock handling is handled by
the intelligent controller.  In this case, it's going to be a 3Ware
Escalade 9000 series.

> LVM2 installs work fine.

Good.  That's my #1 issue.  I can do snapshots later if needbe, or limit
their usage to select filesystems.

> Some things you might want to do:
> 1. Script some infrastructure to monitor snapshot space usage.

I do that anyway for disk usage, so not much there.

> 2. Cron a job to snapshot and fsck the filesystem, so any
> filesystem problems are revealed early.

Why do I need to fsck the filesystem?

> 3. If using Ext3 with data journaling, specify a large journal when
> creating the filesystem (e.g., mke2fs -j -J size=400 ...).

So you recommend Ext3 with full data journaling?

I used to do that back in the 2.2 days with VA Linux kernel, and I
might if I use a PCI NVRAM board.

But I've found Ext3 with ordered writes in 2.4 to be 100% reliable.
Is it not for LVM2/snapshots?

I would _not_ use Ext3 with writeback though, not worth the potential
data loss for small performance gain.

> 4. Tune the filesystem and VM variables: flush time, readahead, etc.

Is there a good reference based on CPU, I/O, memory, etc...?

> 5. Test whether an external journal in the form of an NVRAM card
> or additional disks would improve performance.  (You can try with
> a ramdisk for test purposes).

I'd love to throw such a board in the system, but that's only going
to add costs.  I'm hoping using Ext3 with ordered writes (meta-data)
and NFS v3 async operation will work fine.

Do you see any issues?

-- 
     Linux Enthusiasts call me anti-Linux.
   Windows Enthusisats call me anti-Microsoft.
 They both must be correct because I have over a
decade of experience with both in mission critical
environments, resulting in a bigotry dedicated to
 mitigating risk and focusing on technologies ...
           not products or vendors
--------------------------------------------------
Bryan J. Smith, E.I.            b.j.smith at ieee.org