[Fedora-livecd-list] [PATCH] overlay/persistence second pass - for developer reference only

Tue Aug 21 21:29:29 UTC 2007

Jeremy Katz wrote:
> On Mon, 2007-08-20 at 05:23 -0500, Douglas McClendon wrote:
>> Attached is a revision to the persistence implementation that I posted a
>> couple weeks ago.  This is mainly for Jeremy, Tim, and anyone else who
>> is interested in working on this, or something similar.  I.e. at the
>> very least, it is worth a read to look at the issues I've dealt with,
>> and the several that are in comments as TODO.
> 
> Couple of little things just to make reviewing easier, but not huge
> problems
> * Might be good to keep the initscripts changes as patches rather than
> an orig and a modified version.  Will make things work even if other
> things in initscripts change and also makes it easier to know what's
> going on.  

Agreed, this was still just a second pass....

> * It's good to get into the habit of doing git commits for each separate
> change.  Then you can get a patch per change.  And that would avoid
> having the addidir/addsdir stuff being in the same changes

Actually, I was sort of making a combined point, that I was using 
addsdir as my method of including the modified initscripts.

Long term, some sort elegant flexible permanent change to the 
initscripts are needed.  Medium term, I was planning on having just the 
patches, and actually copying the patches into the initrd and having the 
livecd boot sequence patch the init scripts.

Hopefully this illustrates why as a developer, having addsdir (or better 
named, --add-dir-to-system) functionality is very nice.  If you can show 
me a better workflow...

> 
> Now to get to the meat of things
> 
> index 0000000..8962720
> --- /dev/null
> +++ b/creator/etc_rc.d_init.d_functions
> 
> I suspect that Bill might have some reservations about the hard-coded
> overlayfs piece.  At the same time, it's all I can think of and it's not
> that out of line from other things in halt/rc.sysinit.

I agree with the reservations.  I'm open to suggestions.  For the 
moment, there is a lot of much uglier stuff to deal with first.

> 
> For the overlay info bit, we could potentially just stuff it
> in /sbin/halt.local for now I think.

I saw halt.local.  I don't think you noticed how brutally ugly what I 
was doing was.

The goal of that code after halt.local is to get the overlayfs cleanly 
unmounted.

The way I currently accomplished that, was to YANK the snapshot overlay 
out of the root device.  The only thing that makes this even remotely 
palatable, is the fact that the root device has been remounted read 
only.  Which is the one thing that has happened between this code and 
the halt.local.  (thus making halt.local not a workable place for this code)

Thinking about it, the way to make it less horrendously ugly, would be 
to copy the binaries used from the rootfs (dmsetup, losetup, rm, mount) 
to a tmpfs first, since after the yanking, there is really no guarantee 
that any data read from the rootfs can be relied on.

Or at least those are my thoughts on the issue right now.

> 
> 
> diff --git a/creator/findoverlay b/creator/findoverlay
> new file mode 100755
> index 0000000..e0674cc
> --- /dev/null
> +++ b/creator/findoverlay
> 
> This looks pretty good to me...
> 
> +# load filesystem modules that may be required for overlay
> +# TODO: only load these conditionally if vol_id detects a fs that needs
> them
> 
> Do they not get loaded automatically on the filesystem mount?  That at
> least used to work.

Probably.  These were liberal notes.  Though maybe the fuse ntfs doesn't 
work as nicely.  Not a big deal.

> 
> +# IMPORTANT TODO: while mount scanning find a way to determine if the 
> +#                 filesystem was not cleanly unmounted.  If so, IGNORE IT,
> +#                 as it may be part of a hibernated OS !!!!!!!
> 
> Maybe instead of using cleanly unmounted vs not as the key, we should
> look at swaps to see if they have the SWSUSP signature?  That's a pretty
> straight-forward thing to check, but I can't quite convince myself if
> it's as safe or not.

My worry about this- is things like *3* current hibernate 
implementations for linux.  That means that you have many possible 
signatures to check, and there is no way to predict signature changes in 
future versions of hibernation.

Another possible clincher is things like suspend2's (sorry, 'tux-on-ice' 
now) support for hibernation to files in the rootfs.  I.e. I used to, 
and intend in the future, set up my system with no swap partition at 
all, doing swapfiles, and suspend2-suspend-to-file.  (though I admit I'm 
currently getting some milage out of F7's much improved suspend out of 
the box)

> 
> +# CAVEAT: If the overlay file has a kin file with the suffix .inuse, this
> +#         is evidence that that the overlay device was not unmounted cleanly.
> +#         In _this_ case, look at the filesystem(???) and determine whether
> +#         or not the most recent mount of the filesystem is more recent than
> +#         the inuse file.  *If and only if* NOT, then it is safe to assume
> +#         that the filesystem is not part of a hibernated OS, and rather was
> +#         most recently used as a persistence device that failed to be
> +#         shutdown cleanly, thus it is safe to fsck the overlayfs, and then 
> +#         fsck the overlay-rootfs
> 
> If we checked for the swsusp case instead, would we be able to skip this?

see above...

Also, I just noticed that dumpe2fs does get me cleanly vs uncleanly 
mounted detection for ext2/3.  And vfat I almost don't care about.  I 
would like the same for ntfs, but as I'll mention again, I agree, ntfs 
support can be saved for the long term.

> 
> +# IMPORTANT TODO: since ext3 is such a pain (possible?) to mount readonly,
> +#                 and since similar issues may exist in other fs (ntfs???),
> +#                 I think it would be good to have a function called
> +#                 really_mount_readonly() which does a blockdev --setro, then
> +#                 does a devicemapper snapshot to ram, then does a mount of
> +#                 the snapshotted device, then checks for existence of 
> +#                 overlay and .inuse files.
> 
> If the blockdev is read-only, do we really need to snapshot it too? 

The point is that when blockdev is read-only, you just can't mount it. 
(I think.  I'm pretty sure I even tried mounting ro as ext2 and that 
failed.  But that seems so wrong, I wouldn't bet on it without trying 
first.).

I'll do more experimentation and things will become clearer.

> 
> +# RELATED: Given the above function, if a persistence file is detected,
> +#          but the above above inuse/recent-mount-stamp test fails, give
> +#          the user a 30-60 second timeout option to force an fsck and mount
> +#          of the uncleanly mounted overlayfs, defaulting to not using it.
> 
> Probably fair.  fsck in the initramfs might have fun around controlling
> terminals and sometimes wanting to drop to a shell, so needs some
> testing to make sure it's sane
> 
> +# TODO: All this multiple candidate code hasn't been tested recently (can't
> +#       remember if it ever really did work).  Though I have tested the 
> +#       typical auto case where one overlay is found and used.
> 
> Probably the most important one :)

Actually this was a relic.  As I mentioned in the mail, I actually had 
tested this.  And in fact, I learned, or relearned, a bit more about 
bash arrays, and the code doing this will look much cleaner soon.

> 
> +# TODO: verify that filesystems other than ext3 work.  I know this will 
> +#       probably mean some interesting special case code.
> +
> +# TODO: handle nfs/network(fuse-httpfs?) persistence devices.  This will
> +#       require the ability to set up the network here, which is probably
> +#       not trivial.
> 
> This is one of the reasons I want to get rid of mayflower and build up
> mkinitrd; mkinitrd already has all kinds of network setup code for
> nfs/iscsi root and then we could take advantage of that.  And fwiw, I
> spent a little bit of time getting a branch of mkinitrd started being
> able to do so, but then ran into a need for modprobe to do something
> more.  Will get back there eventually.  On the plus side, trying to make
> sure that we can do that switch without it mattering much for things
> like the overlay finding code (just have to do the little plug-in
> similar to the mayflower change)

Yeah, I had noticed the nfs root stuff, which is part of what made me 
think of network sorts of possibilities here.

> 
> +# TODO: handle fsck'ing the rootfs if need be correctly?  Or does the right
> +#       thing just happen.  I know that trying to use a persistence file
> +#       from something that got unmounted uncleanly, seems to cause problems
> +#       VERY quickly.  This may be a fatal flaw...  (or at least require
> +#       some work)
> 
> fsck of the combined fs should happen fine once we get into the normal
> userspace.  And the ro rootfs shouldn't need fsck'ing.  So I *think* we
> should be fine.  Only needing to then worry about the case of a
> persistence file from an uncleanly mounted filesystem.  Which maybe can
> be punted by saying you use ext3 (with journal, therefore no need for
> fsck usually) or vfat (unclean unmount is less disasterous)

As mentioned by the 'fatal flaw', my apprehension is based on seeing how 
_very quickly_ things seem to fall over dead when trying to use a 
persistence file that did not get cleanly shut down.  (while trying to 
access the fsck binary even...?)

More experimentation again, will flush this issue out.  Obviously if the 
whole system/mechanism cannot robustly deal with repeated yank-the-plug 
situations, then it isn't going to work for real users.

I think I can put together a much more testable-quality patch fairly soon.

I'm not entirely sure about merge worthy within a week...  But we'll 
see.  And I guess I can see something safe enough to merge within a 
week, given the safe default code paths (i.e. not default to auto for f8t2)

> +    losetup /dev/loop119 /mnt/overlayfs/overlay
> +    echo "overlayfs_dev=tmpfs" > /mnt/overlayfs/overlay.inuse
> +    echo "overlayfs_fstype=tmpfs" >>  /mnt/overlayfs/overlay.inuse
> +    echo "overlayfs_path=/overlay" >>  /mnt/overlayfs/overlay.inuse
> +    echo "/mnt/overlayfs/overlay.inuse" > /overlay.info
> 
> Am I missing where this is used or is it just informational?

This isn't really necessary in the traditional tmpfs overlay case that 
you referenced here.  I did it mainly for consistency.  Also, as alluded 
to before, a userspace tool that could online grow the overlay file, 
would use this.  As this /overlay.info file becomes the .inuse file 
which is visible later.  (again, maybe unnecessary.  We'll see if I 
actually find a real use for it)

> diff --git a/creator/mayflower b/creator/mayflower
> index c1c5258..29cc8ec 100755
> --- a/creator/mayflower
> +++ b/creator/mayflower
> @@ -268,6 +290,21 @@ for o in \`cat /proc/cmdline\` ; do
>      live_locale=*)
>          live_locale=\${o#live_locale=}
>          ;;
> +    #
> +    # dmc overlay: aesthetics, undecided about name persistence vs
> overlay
> 
> I actually kind of like overlay.  But yeah, aesthetics :)

I agree.  Persistence is perhaps a better description of the feature for 
end users.  But overlay has the dual benefits of being easier to type, 
and exposes a fairly appropriate amount of information about how it is 
implemented.

> 
> +    # dmc overlay: if non-ram overlay searching is desired, do it,
> +    #              otherwise, create overlay in ram as usual
> +    if [ "x\${overlay}" != "x" ]; then
> +        /sbin/findoverlay "\$overlay"
> +    else
> +        mkdir -p /mnt/overlayfs
> +        mount -n -t tmpfs -o mode=0755 none /mnt/overlayfs
> +        dd if=/dev/null of=/mnt/overlayfs/overlay bs=1024 count=1 seek=
> $((512*1024)) 2> /dev/null
> +        losetup /dev/loop119 /mnt/overlayfs/overlay
> +        echo "overlayfs_dev=tmpfs" > /mnt/overlayfs/overlay.inuse
> +        echo "overlayfs_fstype=tmpfs" >> /mnt/overlayfs/overlay.inuse
> +        echo "/mnt/overlayfs/overlay.inuse" > /overlay.info
> +    fi
> 
> This looks good; though as we had previously discussed, once this is
> working, we probably want auto to be the default and to be able to have
> overlay=off or overlay=ram or something to go back to the current mode.

Agreed.  But this may be a safe avenue if you really want to put code 
this immature in f8t2.

> 
> So yeah, overall, this is looking pretty spiffily good to me and I'm
> leaning towards starting to get it merged in so that we can start
> getting real use of it

We'll see where I'm at in another 24-48 hours, cleaning up the most 
obviously ugly things and perhaps making a more testable patch.

-dmc