[Libguestfs] Fwd: Inspection of disk snapshots

Wed Mar 25 18:53:43 UTC 2015


On 24/03/15 13:32, Richard W.M. Jones wrote:
> On Tue, Mar 24, 2015 at 10:54:05AM +0200, NoxDaFox wrote:
>> I was sure I was doing something wrong as I'm not yet fully aware of QCOW2
>> snapshot feature and how it interacts with libguestfs.
>>
>> I'll try to explain better the scenario:
>>
>> I have several hosts running lots of VMs which are generated from few base
>> images, say A, B, C the base images (backing file) and A1, A2, A*, B1, B2,
>> B* clones on top of which the newly spawned VMs are running.
>> I need to collect the disk states of A*, B*, C* machines and see what has
>> been written there. I don't care about the whole content as the base images
>> content A, B, C are well known to me, only thing it matters are the deltas
>> of the new clones.
>>
>> One more piece in the puzzle is that the inspection does not happen on the
>> hosts running the VMs but on a dedicated server.
>>
>> My idea was to collect those "snapshots" (generic term not the QEMU one)
>> from the hosts and send them to my inspection server. As A, B and C are
>> accessible from that server only thing I need is to rebase those snapshot
>> to correctly inspect them through libguestfs, and it proved to work (I'm
>> using readonly mode as I only care about reading the disks). I'm not really
>> interested in having consistent point-in-time state of the disks as the
>> operation is done several times a day so I can cope with semi-consistent
>> data as it can be easily re-constructed.
>>
>> My real problem comes when I try to inspect the disk snapshot: libguestfs
>> will, of course, let me see the whole content of the disks, which means A +
>> A*. Apart from the waste of CPU time spend on looking at files I already
>> know the state (the ones contained in A), it generates a lot of noise. A
>> Linux base image with some library installed consists in 20+ K files,
>> installing something extra (Apache server for example) just brings some
>> hundreds new files and I'm interested only in those ones.
>>
>> So my real question is: is there a way to distinguish the files contained
>> in the two different disk images (A and A1) or shall I think about a
>> totally different approach?
> Well we have a tool called virt-diff
> (http://libguestfs.org/virt-diff.1.html) which prints the differences
> between two disks.  It's quite commonly used to show the differences
> between an original base image and a snapshot taken some time later,
> so you can tell which files have been modified by the guest.
>
> Now virt-diff works by opening both disks, reading all of the metadata
> (or even the file content if you use the --checksum option), and then
> internally diffing it and presenting the result.
>
> Of course this means it's not especially fast, but it's the way that
> it has to work: The snapshot doesn't contain "files which changed", it
> contains underlying device blocks which changed.  It operates a whole
> layer or two below the filesystem.
If I guess right then, libguestfs' visibility is limited at the FS 
level. In such case my question makes very little sense.
>
> To do this from Python is not particularly hard, but you'll have to
> read the C and translate it.  The guts of the algorithm are in the
> recursive "visitor" mini-library:
>
> https://github.com/libguestfs/libguestfs/blob/master/diff/diff.c
> https://github.com/libguestfs/libguestfs/blob/master/cat/visit.h
> https://github.com/libguestfs/libguestfs/blob/master/cat/visit.c
These are exactly the code snippets I looked at to implement my 
solution. My need are slightly different, I just want the list of files 
which differ and I can trust only the checksum and not the metadata. I 
also need to be able to download such files on the host. It was pretty 
easy thanks to the documentation. It's surprisingly fast and, in case of 
huge deltas, most of the time is spent by in the list comparison in 
Python. Mine idea was just an optimisation as it sounds a bit silly in 
case of very small differences to go through the whole set of files.
>
> There are alternatives -- perhaps parsing the qcow2 snapshot, and
> mapping disk blocks back to files -- but they won't be very easy to
> implement.  I wrote a highly experimental* tool called 'virt-bmap' that
> may be of interest:
>
> https://rwmj.wordpress.com/2014/11/23/mapping-files-to-disk/
> https://rwmj.wordpress.com/2014/11/24/mapping-files-to-disk-part-2/
I'll take a look at them, thanks!
>
> Rich.
>
> * = if it breaks, you get to keep all the pieces
>