Better repodata performance (was: redhat abe)

Sat Jan 29 22:47:45 UTC 2005

On Sat, Jan 29, 2005 at 05:07:00PM -0500, seth vidal wrote:
> > For N packages the ballanced load are log_2 N bins. Adding M
> > packages touches only log_2 M bins. And the bins have a max size
> > of 2^i packages where i goes from 0 to N-1. And the good news is
> > you touch the bins with i < M, e.g. the small ones.
> > 
> > The statistical net effect is that for M package additions to
> > arbitrary N you get log_2 M downloads of a total of 2M packages.
> > 
> > In relevant numbers:
> > 
> > o N~=4000, log_2 N~=12
> >   You have 12 bins.
> > o 10 security/bug fix updates, (statistically) only bins 0 to 4
> >   are changed amounting to 32 packages.  Clients download only 5
> >   files worth of 32 packages in size.
> > 
> > Compare with the current situation, where you need to get the
> > whole lot of N packages for each update.
> > 
> > For this to work you need to
> 
> let's be clear - for this to work YOU need to.
> [...]
> But far be it from to halt the steady march of progress - when you get a
> chance to implement this stuff let me know.

Hey Seth, relax. This is just a suggested concept for improving
things. Someone may pick it up, I didn't enforce it on YOU. ;)

> Oh and once more - who is it gets the benefit from all this work?
> It sounds like it's mostly repo maintainers - not the users.

Did you miss the "User downloads 5 files in size of 32 package
metadata _in total_ vs 4000"? E.g. the user will typically download
less than 1% of what he's downloading now. It benefits by far more the
user base (and perhaps mirror admins) than the repo creator.

> > o introduce package cancelation (anti-packages ;)
> 
> fat chance.

Sorry, my slang is off, does this mean "no way", or "already in
development"? From the context of the rest I'd guess the first. ;)

> > o introduce multiple repodata components
> 
> which buys us not all that much other than complexity of debugging.

It buys you all the nice things already outlined.

> > o keep a manifest of the last state and feed the repo creation system
> >   with the differences (packages lost, packages gained).
> 
> And how do you feed the repo creation system this data? Where do you get
> it to begin with? The only way you know this information is if you
> already have it

But you do, this is about incremental updates to a repository, right?

> - the only way you have it is if you checked all the packages for
> what has changed. Are you beginning to see the loop here?

No.

> If someone wants to combine createrepo and yum-arch into one program so
> it makes both at the same time that's fine - it's about an hour or two
> worth of work,

That's a complete other topic.

> what you're describing above is considerably more, not to mention
> redesigning the depsolvers to deal with the new repository format.

It may even may it simpler, since you don't need to split it into more
importnant and less important data and have file dependencies computed
in two loops.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/fedora-devel-list/attachments/20050129/4e7bfeee/attachment.sig>