Better repodata performance (was: redhat abe)

Sun Jan 30 00:46:33 UTC 2005

On Jan 29, 2005, seth vidal <skvidal at phy.duke.edu> wrote:

> How would it reduce bandwidth - you'd have to download and parse
> multiple entries and you'd STILL have to do just a much work on the
> repo-side b/c you'd have to check all the packages for changes.

The reduced bandwidth would be for the thousands of users who could
download a 1KiB file with the changes since the last time they checked
the repo, instead downloading 4MiB with about 1KiB of new information.

Sure, createrepo would have to look at previous versions of the
repodata, see what changed since then (it could optionally use only
file timestamps and sizes to check that files haven't changed, instead
of having to read them entirely to compute checksums) and generate a
new, incremental repository format.

What I'm thinking is that this incremental repodata tree would contain
the relative location of the original repodata tree, such that whoever
downloads the incremental repodata can get to the previous states, and
so on, by following the paths given.

So we could put in a counter-based repository history with the
following properties:

- after the first run of createrepo, repodata/repomd.xml points to
  repodata/0, without adding or removing anything.

- after the second run of createrepo, repodata/repomd.xml points to
  repodata/1, with a repomd.xml that points to ../0, and primary.xml
  et al files adding/removing packages from ../0

and so on.

every now and then, one could consolidate the multiple repodata
subdirs into a single set of xml files.  You could even do this every
time, and have repomd.xml indicate that you can either get all the
data from this single set of files, or the incremental history from
this other file.

This sort of indirection in repomd.xml has one interesting additional
side effects: if done properly, it would enable us to create composite
and/or filtered repositories.  Your composite repository would
reference a base repository (or a set thereof) in repomd.xml, as well
as package removals or additions so as to filter out packages from one
repository that are say known to be incompatible, and additions from
your own.

This may sure add a lot of complexity to the client side, but reducing
daily downloads of rawhide/i386's primary.xml.gz and filelists.xml.gz
(totaling 4MiB) by however many users track rawhide to a few KiB
sounds like a pretty good idea to me.

-- 
Alexandre Oliva             http://www.ic.unicamp.br/~oliva/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}