Better repodata performance

seth vidal skvidal at phy.duke.edu
Sun Jan 30 21:14:52 UTC 2005


> I hope you're really not saying that, if I request to install package
> foo, that depends on bar, it will also download headers for baz, a
> totally unrelated package.  I can see that we'd need headers for foo
> and bar, but not for baz.  I thought the point of the xml files and
> the info on provides, filelists, etc, was precisely to enable the
> depsolver to avoid having to download the headers for every package.



Just so we don't go off into deeply uninformed space:

yum 2.0.X downloaded all the headers in the headers directory that it
did NOT have installed. It figured this out by reading header.info. This
file stored nevra + rpm location. So yum 2.0.X downloaded this file to
see what new headers it needed, downloaded them, then got on with the
process at hand.




> I'm wondering if it would be possible for a depsolver to create a
> (smaller) .hdr file out of info in the .xml files, and feed that to
> rpmlib for transaction-verification purposes.  This would enable it to
> skip the download-header step before downloading the entire package.


Talk to Paul Nasrat - he was working on that a while ago but I think he
got stuck in some rabbit hole debugging something.



> Definitely.  But couldn't we perhaps do it by intelligently filtering
> information out of the rpm header and, say, generating a single
> archive containing all of the info needed for depsolving and for
> rpmlib's transaction verification?

you can't do that b/c file conflicts CAN NOT be calculated via rpm w/o
having the full header and/or all the file information present.



> I was expecting depsolving wouldn't require all the headers.  And from
> what I gather from your reply, it indeed doesn't.

it requires all the headers of the packages involved, yes.


> Let's consider two scenarios: 1) using up2date with yum-2.0 (headers/)
> repos (whoever claimed up2date supported rpmmd repodata/ misled me :-)
> and 2) using yum-2.1 (repodata/) repos.
> 
> 1) yum 2.0
> 
>   16MiB) initial download, distro's and empty updates's hdrs
> 
>    8MiB) daily (on average) downloads of header.info for updates,
>      downloaded by rhn-applet, considering an average size of almost
>      30KiB, for 40 weeks.  (both FC2 and FC3 updates for i386 have a
>      header.info this big right now)
> 
>   16MiB) .hdr files for updates, downloaded by the update installer.
>      Current FC2 i386 headers/ holds 9832KiB, whereas FC3 i386
>      headers/ holds 8528KiB, but that doesn't count superseded
>      updates, whose .hdr files are removed.  The assumption is that
>      each header is downloaded once.  16MiB is a guestimate, that I
>      believe to be inflated.  It doesn't take into account the
>      duplicate downloads of header.info for updates, under the
>      assumption that a web proxy would avoid downloading again what
>      rhn-applet has already downloaded.
> 
> ----
> 
>   40MiB) just in metadata over a period of 9 months, total
> 
> 2) yum 2.1
> 
>    2.7MiB) initial download, distro's and empty updates'
>      primary.xml.gz and filelists.xml.gz
> 
>   68MiB) daily (on average) downloads of primary.xml.gz, downloaded by
>      rhn-applet, considering an average size of 250KiB (FC2 updates's
>      is 240KiB, whereas FC3's is 257KiB, plus about 1KiB for
>      repomd.xml)
> 
>   16MiB) .hdr files for updates, downloaded by the update installer
>   (same as in case 1)
> 
>  192MiB) filelists.xml.gz for updates, downloaded twice a week on
>  average by the update installer, to solve filename dep.
> 
> ----
> 
>  278.7MiB) just in metadata over a period of 9 months, total
> 
> 
> Looks like a waste of at least 238.7 MiB per user per 9-month install.
> Sure, it's not a lot, only 26.5MiB a month, but it's almost 6 times as
> much data being transferred for the very same purpose.  How is that a
> win?  Multiply that by the number of users pounding on your mirrors
> and it adds up to hundreds of GiB a month.



> Another factor is that you probably won't need filelists.xml.gz for
> every update.  Maybe I don't quite understand how often it is needed,
> but even if I have to download it only once a month, that's still
> 64MiB over 9 months, more than the 40MiB total metadata downloaded
> over 9 months by yum 2.0.

yum 2.1.x ONLY DOWNLOADS THE XML FILES WHEN IT NEEDS THEM.


go read the code and stop guessing.

it downloads repomd.xml everytime - that's < 1K.
it downloads primary.xml.gz if the file has changed - that's typically <
1M.

it downloads filelists.xml.gz only when there is a file dep that it
cannot resolve with primary.xml.gz.



> I don't know how yum 2.0 did it, but up2date surely won't even try to
> download a .hdr file if it already has it in /var/spool/up2date, so
> this is not an issue.

yum 2.0.x certainly DID NOT download a .hdr file it already had. Sheesh,
go read the code, stop making suppositions based on anecdotes.

> repodata helps the initial download, granted, but it loses terribly in
> the long run.

only as the number of file deps outside of /etc/* and *bin/* increases.

if you keep the file deps in those paths then repodata is a huge win.

-sv





More information about the fedora-devel-list mailing list