[dm-devel] [PATCH RFC] dm snapshot: shared exception store

Mon Aug 11 23:34:17 UTC 2008

On Mon, 11 Aug 2008 18:12:08 -0400 (EDT)
Mikulas Patocka <mpatocka at redhat.com> wrote:

> > > - drop that limitation on maximum 64 snapshots. If we are going to 
> > > redesign it, we should design it without such a limit, so that we wouldn't 
> > > have to redesign it again (why we need more than 64 --- for example to 
> > > take periodic snapshots every few minutes to record system activity). The 
> > > limit on number of snapshots can be dropped if we index b-tree nodes by a 
> > > key that contains chunk number and range of snapshot numbers where this 
> > > applies.
> > 
> > Unfortunately it's the limitation of the current b-tree
> > format. As far as I know, there is no code that we can use, which
> > supports unlimited and writable snapshot.
> 
> So use different format --- we in RedHat plan redesigning it too. One of 
> the needed features is "rolling snapshots" --- i.e. you take snapshot 
> every 5 minutes or so and you keep them around. The result is that you 
> have complete history of the system activity.

I think that implementing a better format is far more difficult than
you think. for example, see the tux3 vs. HAMMER discussion between
Daniel Phillips and Matthew Dillon.

Unless Alasdair tells me that unlimited snapshots is a must, probably
I will not work on it. I'm focusing integrating a snapshot feature
into dm cleanly.

Of course, I'm happy to use the better snapshot code if it's
available.

> And this 64-snapshot limitation would not allow this. The problem if we 
> use this format is that we will spend a lot of time developing and 
> finalizing it --- and then a requirement for rolling snapshots comes --- 
> and we'll have to throw it away and start from scratch. So I'd rather do 
> b-tree without limitation on number of snapshots from the beginning.

The advantage of taking the snapshot code from Zumastor is that it has
worked for a while. I don't expect much effort to stabilize the
snapshot code. The main issue here is that how to integrate it into dm
nicely.

I think that we have the version number in the super block to handle
better snapshot formats.

> Another good thing would be the ability to compress several consecutive 
> chunks into one b-tree entry. But I think with multiple snapshots, there 
> is no clean way how to do it. Maybe design it without this possibility, 
> and then use some dirty hack to compress consecutive chunks in most common 
> cases (such as for example when no one writes to the snapshots).
> 
> > > - do some cache for metadata, don't read the b-tree from the root node 
> > > from disk all the time.
> > 
> > The current code already does.
> 
> I see. That GFP_NOFS allocation shouldn't be there, because
> - it is not reliable
> - it can recurse back into block writing via swapper (use GFP_NOIO to 
> avoid that)
> 
> The correct solution would be to preallocate one or more buffers in the 
> target constructor. When running, get additional buffers with GFP_NOIO, 
> but if that fails, use the preallocated buffer. --- this way it can handle 
> temporary memory shortage without data corruption.
> 
> I'll write some generic code for that caching, I think it could be useful 
> even for other targets, so it'd be best to write it into main dm module.

I'm not sure that other dm targets need such feature but I'm happy to
use it if it is provided. Next time, I'll submit this feature as a
separate patch.

> > > - the b-tree is good structure, I'd create log-structured filesystem to 
> > > hold the b-tree. The advantage is that it will require less 
> > > synchronization overhead in clustering. Also, log-structured filesystem 
> > > will bring you crash recovery (with minimum coding overhead) and it has 
> > > very good write performance.
> > 
> > A log-structured filesystem is pretty complex. Even though we don't
> > need a complete log-structured filesystem, it's still too complex,
> > IMO.
> 
> I think it's not really harder than journaling. Maybe it's even easier, 
> because in journaling you have replay code that is very hard to test and 
> debug (ext3 had some replay bug even recently). In log-structured 
> filesystem there is no replay code, it is always consistent.
>
> (I obviously don't mean to develop the whole filesystem for that --- just 
> use the main idea that you write always forward into unallocated space)
> 
> + good for performance, majority of operations are writes
> + doesn't need cache-synchronization for cluster
> + can be simultaneously read by more cluster nodes and written by one 
> cluster node (all other formats require read:write exclusion)

A log-structured file system is much more difficult than
journaling. And it's not better than it looks.

If a log-structured file system is really nice, we have tons of
log-structured file systems. In reality, we don't. AFAIK, no
widely-used operating systems (such as Linux, *BSD, Solaris, Windows,
etc) don't use a log-structured file systems as a default file system.

> > A copy-on-Write manner to update the b-tree on disk (as some of the
> > latest file systems do) is a possible option.
> 
> That is what I mean.

Then, I don't think you are talking about a log-structured file
system. In general, we don't classify a copy-on-write file system like
ZFS as a log-structured file system.

> When we modify a node, one possibility is to write 
> b-tree blocks back to the root to unallocated space. The other possibility 
> is to write just one block to new space and mark it in superblock as 
> "redirected" from the old location. When the array of redirected blocks 
> fills up, write all b-tree blocks up to the root and erase the array of 
> redirected blocks (this will improve performance because you don't have to 
> write the full path up to root on every block update).
> 
> Another question is, where the superblock should be located. Just one 
> superblock at the beginning would be bad for disk seeks, maybe have 
> superblock at each disk track (approximatelly ... we don't know where the 
> tracks area), use some sequence counter to tell which one is the newest, 
> and write to the one that is near to the data.
> 
> > Another option is using journaling as I wrote.
> > 
> > 
> > > - deleting the snapshot --- this needs to walk the whole b-tree --- it is 
> > > slow. Keeping another b-tree of chunks belonging to the given snapshot 
> > > would be overkill. I think the best solution would be to split the device 
> > > into large areas and use per-snapshot bitmap that says if the snapshot has 
> > > some exceptions allocated in the pertaining area (similar to the 
> > > dirty-bitmap of raid1). For short lived snapshots this will save walking 
> > > the b-tree. For long-lived snapshots there is no help to speed it up... 
> > > But delete performance is not that critical anyway because deleting can be 
> > > done asynchronously without user waiting for it.
> > 
> > Yeah, it would be nice to delete a snapshot really quickly but it's
> > not a must.
> 
> Mikulas
> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel