[Libosinfo] RFC: Splitting off database into separate package

Daniel P. Berrange berrange at redhat.com
Fri Jul 24 15:54:57 UTC 2015


On Fri, Jul 24, 2015 at 04:50:34PM +0200, Christophe Fergeau wrote:
> Hi,
> 
> On Wed, Jul 22, 2015 at 11:46:23AM +0100, Daniel P. Berrange wrote:
> >  - Is XML the format we want to use long term ?
> > 
> >    We already ditched XML for the PCI & USB ID databases, in favour of
> >    directly loading the native data sources because XML was just too
> >    damn slow to parse. I'm concerned that as we increase the size of
> >    the database we might find this becoming a more general problem.
> > 
> >    So should we do experiments to see if something like JSON or YAML
> >    is faster to load data from ?
> > 
> >    If we want to use a different format, should we do it exclusively
> >    or in parallel
> > 
> >    eg should we drop XML support if we switch to JSON, or should
> >    we keep XML support and automatically generate a JSON version
> >    of the database.
> 
> Currently we rely on intltool to handle translation of the database XML files,
> gettext seems to be able to handle javascript, I don't know if
> this can be used for json files as well. So maybe we'll have to keep the
> xml files as a way to manage translations.

Hmm, yes, that's a good point. Another option would be to write our
own tool to turn the .json files into .pot files, and to merge the
.po files back into the .json. Definitely something we need to keep
in mind though - indeed how we even represent translations in the
.json files if we choose that format.

> >  - Should we restructure the database ?
> > 
> >    eg, we have a single data/oses/fedora.xml file that contains
> >    the data for every Fedora release. This is already 200kb in
> >    size and will grow forever. If we split up all the files
> >    so there is only ever one entity (os, hypervisor, device, etc)
> >    in each XML file, each file will be smaller in size. This would
> >    also let us potentially do database minimization. eg we could
> >    provide a download that contains /all/ OS, and another download
> >    that contains only non-end-of-life OS.
> 
> I was about to make the same comment as Zeeshan, GNOME has had issues in
> the past with data scattered among too many small files, in general this
> is solved by adding a cache file containing a concatenated version of
> all the files (possibly pre-parsed to some domain-specific format).

If we can avoid loading the entire database, and only load the subset
of files we want info on, we'd hopefully not have such problems. I
could see benefit in having some "index" file perhaps which says
which entity is defined in which file, as a way to avoid dictating
a filename/dirname convention.

> >  - Should we formalize the specification so that we can officially
> >    support other library implementations
> > 
> >    While libosinfo is accessible from many languages via GObject
> >    introspection, some projects are still loathe to consume python
> >    libraries backed by native code. eg openstack would really
> >    prefer to be able to just pip install a pure python impl.
> > 
> >    Currently libosinfo library includes some implicit business
> >    logic about how you load the database, and dealing with overrides
> >    from different files. eg if you have the same OS ID defined in
> >    multiple XML files which one "wins". Also which paths are supposed
> >    to be considered when loading files. In the future also possibly
> >    how to download live updates over the net. It also has logic about
> >    how you detect ISO images & install trees from the media data and
> >    how to generate kick start files, etc, none of which is formally
> >    specified or documented.
> 
> This could be nice, but I guess this could come later (possibly at the
> same time as the database schema versioning if some database format
> changes are needed in order to accomodate these independent
> implementations)

Yep, this could be where non-XML format like JSON is appealing. A
python program could just do  json.loads("os/fedora20.json") to get
a datastructure with all the info (well and load the parent OS that
are referenced via derives-from tags, etc)

> >  - How do we provide notifications when updates are available
> > 
> >    eg, we don't want 1000's of clients checking the libosinfo website
> >    daily to download a new database, if it hasn't changed since they
> >    last checked. Can we efficiently provide info about database updates
> >    so people can check and avoid downloading if it hasn't changed. I
> >    have thought about perhaps adding a DNS TXT record that records
> >    the SHA256 checksum of the database, so clients can do a simple
> >    DNS lookup to check for update availability. This is nice and scalable
> >    thanks to DNS server caching & TTLs, avoiding hitting the webserver
> >    most of the time.
> 
> 
> This also means more special magic to be implemented by libosinfo
> consumers, which is not necessarily an issue. If libosinfo is to
> download database updates more or less automatically, we'll need
> to make this downloading as safe as possible (https, gpg signature with
> a known key ?)

As the person currently paying bills for the server hosting libosinfo.org
I won't want us to setup libosinfo to automatically download updates :-P
Also many corporations would not want such automatic downloads taking
place, as it could result in silently changing the way their OS installs
work without their knowledge.  I would like to figure out a way we can
provide a tool that people can opt-in to using to pull down updates
when they need them though. We'd certainly need some form of security
here as you mention, and likely also a way to setup local mirrors internal
to an organization. This item is probably the last thing to worry in my
big list of todos, more of a nice to have. Need to focus on just getting
the DB split out first.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|




More information about the Libosinfo mailing list