The design of LVMetaD
=====================

Invocation and setup
--------------------

The daemon should be started automatically by the first LVM command issued on
the system, when needed. The usage of the daemon should be configurable in
lvm.conf, probably with its own section. Say

    lvmetad {
        enabled = 1 # default
        autostart = 1 # default
	socket = "/path/to/socket" # defaults to /var/run/lvmetad or such
	scan_mode = "udev" or "full" # defaults to full
	full_scan_expiry = 30 # seconds
    }

The enabled, autostart and socket bits should be self-explanatory. As for
scan_mode and full_scan_expiry, I will elaborate on that in a bit.

Library integration
-------------------

When a command needs to access metadata, it currently needs to perform a scan
of the physical devices available in the system. This is a possibly quite
expensive operation, especially if many devices are attached to the system. In
most cases, LVM needs a complete image of the system's PVs to operate
correctly, so all devices need to be read, to at least determine presence (and
content) of a PV label. Additional IO is done to obtain or write metadata
areas, but this is only marginally related and addressed by Dave's
metadata-balancing work.

In the existing scanning code, a cache layer exists, under
lib/cache/lvmcache.[hc]. This layer is keeping a textual copy of the metadata
for a given volume group, in a format_text form, as a character string. We can
plug the lvmetad interface at this level: in lvmcache_get_vg, which is
responsible for looking up metadata in a local cache, we can, if the metadata
is not available in the local cache, query lvmetad. Under normal circumstances,
when a VG is not cached yet, this operation fails and prompts the caller to
perform a scan. Under the lvmetad enabled scenario, this would never happen and
the fall-through would only be activated when lvmetad is disabled, which would
lead to local cache being populated as usual through a locally executed scan.

Therefore, existing stand-alone (i.e. no lvmetad) functionality of the tools
would be not compromised by adding lvmetad.

Scanning
--------

Certainly, the responsibility for scanning is now shifted to lvmetad. It can
(and should) leverage the existing scanning infrastructure, by calling into the
existing lvmcache code. Of course, it needs to instruct lvmcache not to go
through lvmetad in this case, as that would be pointless. This should not be
hard to achieve.

In the most pessimistic case, we need to do a full rescan for each new command
that asks for metadata: we have to assume that devices could disappear or
appear out of nowhere. This is not much different from the current behaviour,
where each command does its own scanning.

A conservative improvement would be to add an expiry timeout, so that the cache
is considered valid for a while, which would make command bursts a lot more
efficient, avoiding lots of extra scanning. Presumably something like 30
seconds would be quite reasonable: this would be of course made configurable,
through full_scan_expiry. This would control at most how much time will pass
between a device (dis)appears in a system and LVM takes a notice of this
fact. (Fortunately, with the transient status support in lvconvert --repair,
this won't prevent mirror repairs to happen as soon as a write error occurs on
the device.)

The above would constitute the "full" scanning strategy, as specified by
scan_mode in lvm.conf. It should work about the same for common situations as
the current per-command scanning, and better for some less common ones,
involving bursts of commands (I understand this can happen with RHEV, for
example).

The other strategy, "udev", would then provide a smarter and nearly optimal
scanning behaviour. However, this would depend on presence and reliability of
udev and lvm-specific udev rules. Therefore, this option should only be enabled
by distributions that ship the required support, and disabled by default
upstream. The strategy would need to make a few assumptions about the
environment:

1) only LVM commands ever touch PV labels and VG metadata
2) when a device is added or removed, udev fires a rule to notify lvmetad

As for 1), this is something that needs to be documented and could be a bit
tricky. The catch is that udev cannot notify us about changes to the data on a
device and we basically have no way to find out. If the admin overwrites the PV
label and VG metadata on a disk with dd (mkfs, mkswap, ...), they are out of
luck. I don't think this is a major roadblock, it ought to be a question of
education.

In case 1), however, not all is lost -- a metadata write should fail in these
cases, and the new behaviour of active LVs on top of the vanished PV is no
different from the pre-existing one. No (new) catastrophic failures should
occur, at worst we can get a possibly surprising metadata write error
somewhere.

As for 2, we need a trap into lvmetad from commandline: one option would be to
extend "pvscan" syntax, to allow something like:

    $ pvscan --lvmetad /dev/foo
    $ pvscan --lvmetad --remove /dev/foo
    $ pvscan --lvmetad

The first case would instruct lvmetad to scan and cache labels and metadata
from /dev/foo, the second to forget about existence of /dev/foo and the last to
force a full rescan. Alternatively, vgscan (with or without --lvmetad) could be
used for the last one. (I am most inclined to the plain vgscan option: it
should notify lvmetad if it is enabled and proceed normally if not. Should mesh
in reasonably with existing vgscan behaviour.)

Incremental scan
----------------

There are some new issues arising with the "udev" scan mode. Namely, the
devices of a volume group will be appearing one by one. The behaviour in this
case will be very similar to the current behaviour when devices are missing:
the volume group, until *all* its physical volumes have been discovered and
announced by udev, will be in a state with some of its devices flagged as
MISSING_PV. This means that the volume group will be, for most purposes,
read-only until it is complete and LVs residing on yet-unknown PVs won't
activate without --partial. Under usual circumstances, this is not a problem
and the current code for dealing with MISSING_PVs should be adequate.

However, the code for reading volume groups from disks will need to be adapted,
since it currently does not work incrementally. Such support will need to track
metadata-less PVs that have been encountered so far and to provide a way to
update an existing volume group. When the first PV with metadata of a given VG
is encountered, the VG is created in lvmetad (probably in the form of "struct
volume_group") and it is assigned any previously cached metadata-less PVs it is
referencing. Any PVs that were not yet encountered will be marked as MISSING_PV
in the "struct volume_group". Upon scanning a new PV, if it belongs to any
already-known volume group, this PV is checked for consistency with the already
cached metadata (in a case of mismatch, the VG needs to be recovered (if
possible), probably automatically by lvmetad) and is subsequently unmarked
MISSING_PV.

The most problematic aspect of this whole conception may be orphan PVs. At any
given point, a metadata-less PV may appear orphaned, if a PV of its VG with
metadata has not been scanned yet. Eventually, we will have to decide that this
PV is really an orphan and enable its usage for creating or extending VGs. In
practice, the decision might be governed by a timeout or assumed immediately --
the former case is a little safer, the latter is probably more transparent. I
am not very keen on using timeouts and we can probably assume that the admin
won't blindly try to re-use devices in a way that would trip up LVM in this
respect. I would be in favour of just assuming that metadata-less VGs with no
known referencing VGs are orphans -- after all, this is the same situation as
we have today. The metadata balancing support may stress this a bit more than
the usual contemporary setups though.

It may also be prudent to provide a command that will block until a volume
group is complete, so that scripts can reliably activate/mount LVs and such. Of
course, some PVs may never appear, so a timeout is necessary. Again, this is
something not handled by current tools, but may become more important in
future. It probably does not need to be implemented right away though.

Cluster support
---------------

When working in a cluster, clvmd integration will be necessary: clvmd will need
to instruct lvmetad to re-read metadata as appropriate due to writes on remote
hosts. Overall, this is not hard, but the devil is in the details. I would
possibly disable lvmetad for clustered volume groups in the first phase and
only proceed when the local mode is robust and well tested.

Protocol & co.
--------------

I expect a simple text-based protocol executed on top of an Unix Domain Socket
to be the communication interface for lvmetad. Ideally, the requests and
replies will be well-formed "config file" style strings, so we can re-use
existing parsing infrastructure.

Since we already have two daemons, I would probably look into factoring some
common code for daemon-y things, like sockets, communication and maybe logging
and re-using it in all the daemons (clvmd, dmeventd and lvmetad).

Some thread management code may be share-able as well, since we will likely use
threads in lvmetad as well, to serve multiple clients at once.

Future extensions
-----------------

The above should basically cover the use of lvmetad as a cache-only
daemon. Writes are still executed locally, and a request to re-read a VG needs
to be issued to lvmetad after a VG write. This is fairly natural and in my
opinion reasonable. The lvmetad acts like a cache that will hold metadata, no
more no less.

Above this, there is a couple of things that could be worked on later, when the
above basic design is finished and implemented.

_Metadata writing_: We may want to support writing new metadata through
lvmetad. It is currently not clear if there are advantages other than avoiding
a re-read of a VG that was just written though.

_Locking_: Other than directing metadata writes through lvmetad, one could
conceivably also track VG/LV locking through the same. I don't have a
convincing use-case yet though, so I am leaving this as future work.

_Clustering_: A deeper integration of lvmetad with clvmd might be possible and
maybe desirable. Since clvmd communicates over the network with other clvmd
instances, this could be extended to metadata exchange between lvmetad's,
further cutting down scanning costs. This would combine well with the
write-through-lvmetad approach.