Opinions on new Fedora Core 2 install with LVM 2 and snapshots?

Mon Jul 26 22:41:52 UTC 2004

[ Thanx again for your help! ]

Bill Rugolsky Jr. wrote:  
> Well, that is also what we are doing.  We need on-site and off-site
> backup of our NetApp filer, and can do it with a Linux system for $2K
> apiece.

Exactly.  It's my old employer, who I now consult for, but doesn't have
the money for more than 10 engineers these days.  The NetApp is where
95% of the Linux and Solaris client mount.  We actually didn't shouve
out $8K more for the SMB service on the NetApp, because we like it to be
largely NFS.

The Linux server is where 5% of the UNIX client traffic goes, so NFS
access is minimal.  So it largely does small Windows client SMB access. 
In fact, we NFS mount the NetApp and do SMB on behalf of it (because
it is so limited -- again, saved $8K because the engineers use it so
little).

> P4 2.8GHz, 1GB RAM, dual SATA 250GB in MD RAID1.

This will be a dual-P3, ServerWorks chipset, 1GB RAM, 1000Base-SX
Gigabit NIC (storage and NIC on different PCI channels), quad-SATA in
hardware RAID-5 (possibly RAID-0+1 instead, not sure yet).  The new
3Ware Escalade 9000 series products have upto 1GB SDRAM in addition
128MB SRAM, so I'm not too worried about the RAID-5 performance.

> My personal system is a dual Opteron 246, 4GB RAM

Sounds like what I want to build.  I'm waiting for the IWill board that
attaches the nVidia "CK8-04 Pro" (nForce4?) 24-channel PCI-Express +
Legacy PC (LPC) HyperTransport tunnel to one CPU, and then an AMD8131
PCI-X 1.0 HyperTransport tunnel to the other.

Talk about I/O options and no bottlenecks!
AnandTech had a picture of it here:  
http://images.anandtech.com/reviews/motherboards/fall2004preview/iwill2a64.jpg  
It should be sub-$500, probably sub-$400.  Half of that is really
because of the AMD8131 plus PCI-X traces.  The "CK8-04 Pro" (nForce4?)
will be nVidia's next "commodity" chipset for Socket-939/940 mainboards.
So there's really no added cost there (which should be sub-$200 on its
own in a mainboard, without the AMD8131 plus PCI-X traces).

> 4x200GB SATA

3Ware controller or software RAID?

At $325 for a 4-channel 3Ware Escalade 9500S-4LP, I really see no reason
not to put one in when you have 4 drives.  You get a powerful 64-bit
ASIC, plus SRAM for 0 wait state transfers/queuing, plus 128MB
(expandible upto 1GB) of NVRAM backed SDRAM for buffered I/O (RAID-5
writes as well as general read buffer).

Heck, it's probably worth it to upgrade to 512MB or 1GB on the 3Ware
Escalade 9000 _instead_ of a separate PCI NVRAM board from just a
performance standpoint.  You get it NVRAM backed up right there.

> with each drive split into 3 equal partitions, for playing around with
> various MD configs.  I'm looking at tuning the whole NFS I/O path on the
> latter,

Let me know what you find.  I'm _avoiding_ MD.  I'm _only_ interested in
using LVM for snapshots.  I'd rather let the 3Ware 64-bit ASIC do all the
queuing, sector remapping and RAID -- including SRAM for the 0 wait state
operations and SDRAM for the buffered I/O (especially RAID-5 writes).

People buy layer 3 network switches for performance, instead of using a
Linux PC as a router.  A 3Ware card is the same thing versus using software
RAID in my opinion.  Only for RAID-0 does it matter little (see my 2004
April article in Sys Admin magazine for more on RAID efficiency by level).

And 3Ware does a great job of releasing GPL drivers.  Since all the "brains"
of the RAID is on the card, the drivers are simple block code.  And unlike
the traditional intelligent ATA/SCSI RAID route that use microcontrollers
and multi-wait-state SDRAM, 3Ware 9000 series has both SRAM and SDRAM on-board
for the best of all worlds.  Especially with ATA which is non-blocking I/O
for just about everything except RAID-5 writes (which the 9000 series has
SDRAM for buffering now, unlike the 5000-8000 before it).

> I want to experiment with various configs first, e.g., filesystem LV on
> one RAID1 PV, journal and/or snapshot LV on the other PV.

I just want snapshots, that's it.  I try to sell companies on a 1GB PCI
NVRAM board for a full data Ext3 journal, but most don't go for it.
With the 3Ware Escalade, it probably removes the issue now.

> RAID6 also.

What is RAID-6?  Is that 2 parity stripes so you can lose 2 disks?

>  Once GFS clustering stabilizes on 2.6, I suppose I'll start over with
> a cluster config ...

Oh definitely with you there!  Will start doing iSCSI liberally once
GFS is in stock Fedora.

> That should be fine.
> I've been working from Arjan's Fedora test kernels, dropping the 4G/4G
> and turning off highmem completely.  I've also added kexec and a few
> other goodies.
> Arjan has been tracking the BitKeeper snapshots pretty closely.

I looked inside the Red Hat SRPM for the 2.6.7-1.494 kernel and it's
2.6.8-RC1 based.  I think I'll just stick with that.

One thing that I always fear is having the kernel and user-space tools
"out-of-sync."  So sticking with built RPMS (even if I have to rpmbuild
them from SRPMS myself) typically tells me what user-space dependencies
there are.

I've seen people who choose Mandrake and ReiserFS and run into that
repeatedly.  ReiserFS works fine until an off-line fsck is required --
then bam!  There goes their data because the off-line tools, not so much
because of ReiserFS itself.  I know, Ext3 and XFS don't change internal
structures like ReiserFS does, but I still am always shy with kernels.

> Well, of course, we want to get them fixed, and bug reports are useful. :-)

I have a very similar setup here at my house, dual-CPU, older 3Ware card,
etc...  I'm going to use that as a "test server" for this setup first.

But for my company, I need production-quality.  So if it's got bugs, I ain't
going to enable snapshots until it's ready.  I'm glad to here LVM2 on their
own are fine.

> FWIW, several commercial appliances apparently use XFS.

It's large file performance is why it always found a home in video servers.

I also adopted XFS early on (2001 February), because it handled all sorts
of things on 2.4 that Ext3 struggled with for awhile -- Quotas, ACLs, etc...

> I feel no compelling need to abandon Ext3; in my experience, the
> filesystem and tools are extraodinarily robust,

Of course.  I trust Ext3 as well, because it hasn't changed since Ext2 of
the mid-'90s.

But that's also true of XFS, it remains unchanged since the mid-'90s as
well.  They directly ported it over from Irix.  I trust it as well.  NFS
support was also excellent.

I'm shy to even try IBM's JFS, because it comes from OS/2 and not AIX.
JFS really lacked a _lot_ of traditional UNIX capabilities in its first
releases on Linux, unlike XFS.

I wondered why until the whole "Project Monterey" falling out happened
with SCO and IBM, eventually resulting in the lawsuit that has gone
well beyond.  So it made sense once I looked at it from that.

> and performance has always been adequate for my purposes.

I didn't use XFS for performance.  I used it for features and maturity.
I'd rather stick with Ext3 because that's what Red Hat supports.

But if people feel XFS is better for LVM2, then I'd use it.  I didn't
see SGI releasing XFS for Fedora releases (unlike RHL before it), so I
figured there was little reason to go that direction.  But I had to
ask.

> If you want to do hardcore testing, you need to choose one of the
> several methods to switch off writes to the device at the block
> layer, and then loop randomly wrecking and recovering the filesystem
> and looking for corruption.  (See Andrew Morton's test tools in Jeff
> Garzik's gkernel.sourceforge.net repository.)

I might do that on my personal workstation, but not my home server let
alone a client's server.

I just want a reliable volume manager.  I'll enable snapshots when
everyone feels they are production-quality.  I was hoping LVM2 w/device
mapper was there now -- as long as that's _all_ I'm using it for,
snapshots and _nothing_ else.

I'll leave the redundancy features to an underlying, intelligent
storage controller.  It makes life more simple for a sysadmin.

> I like the 3ware controllers, but until their meta-data is supported
> by dmraid or the like, I'll pass.

Why?  Is there some bonus for dmraid?  I rather like the fact that the
OS has no idea what is underneath.

> Because every kernel has bugs, and hardware can be flakey.
> Corruption can occur irrespective of journaling.

Really?  I haven't run into this with Ext3 or XFS (other than the one
XFS 1.0 bug that took out my /var on one system).  Is LVM2 flaky?

I'd rather not run it if so.

> Well, here's the theory: when doing synchronous NFS commits, full
> data journaling only requires a sequential write to the journal;

Correct.  The logic is rather simplistic.

> the data gets written back to the filesystem asynchronously.  If it
> is on a separate spindle or in NVRAM, it is decoupled from both the
> read traffic and the asynchronous writeback.  With NFS, the latency
> of write acknowledgements typically affects throughput, so improving
> one improves the other.

Correct.  And I totally agree with you on a preference for using a
NVRAM board for Linux NFS servers.

But I'll probably just do NFS async with Ext3 ordered-writes on
systems where I don't have a NVRAM board.  Most clients have balked
at the idea of another $1K to the system cost.

> I haven't done much experimenting, but over the years folks have
> posted mixed results on ext3-users and nfs mail lists with various
> combinations of data journal mode and internal, external, or
> NVRAM journals.

I've never seen improved performance.  But I _do_ like the "piece of
mind" that I'm doing NFS v3 sync, instead of async.

But if there is not any issue with LVM2+DM+Snapshots, then I'll just
use NFS async with Ext3 ordered-writes.

> None that I'm aware of, but I know that you've been lurking on the
> nfs and ext3-users list for years -- search the archives. ;-p

Yeah, I need to do that.  Maybe I can help "beef up" the Linux NFS
HOWTO with some info.

So if you have _any_ info, I'd be willing to document it into one
guide.

> Seriously, there are quite a few performance discussions and tuning
> suggestions over the years involving Neil Brown, Tom McNeal, Chuck
> Lever and others mostly on the NFS side of things, Andrew Morton,
> Stephen Tweedie, and Andreas Dilger mostly on the Ext3/VM side.

I really want to write an expanded HOWTO on how to build a production
Linux NFS server with LVM2, Snapshots and Ext3 (possibly XFS as well).

I started one back in the late 2.2 days with the Brown+Trond NFS 
patches, but then Seth updated the HOWTO and I just forgot out it.

In all

> You should measure the difference between NFS async and sync operation.
> If things are working correctly, 2.6 sync should not be too shabby.

With an NVRAM board, I don't doubt it.  But without one, I think I'll
stick with NFS async.

Although the 128MB (upto 1GB) of NVRAM SDRAM buffer on the 3Ware Escalade
9000 series is sure to help (not to mention the 2-4MB of SRAM for queuing,
damn I love 3Ware's "storage switch" ASIC approach).

> As for CIFS, I have no clue.

I'm not too worried about SMB.  SMB access is rather limited.

The engineers use NFS, 95% of which goes to the NetApp, and then Rsync.

> Now, I need to go take my own advice, when I find a few free hours ...

Hey, if you have any notes, I'm more than willing to put them into
a HOWTO.  Thanx dude!

-- Bryan

P.S.  Do I need to do anything beyond loading a LVM2 kernel with "device
mapper" to use "pvcreate" to do snapshots?

-- 
     Linux Enthusiasts call me anti-Linux.
   Windows Enthusisats call me anti-Microsoft.
 They both must be correct because I have over a
decade of experience with both in mission critical
environments, resulting in a bigotry dedicated to
 mitigating risk and focusing on technologies ...
           not products or vendors
--------------------------------------------------
Bryan J. Smith, E.I.            b.j.smith at ieee.org