Yum, Proxy Cache Safety, Storage Backend

James Antill james.antill at redhat.com
Thu Jan 24 12:41:47 UTC 2008


On Wed, 2008-01-23 at 21:41 -0500, Warren Togami wrote:
> I just had an in-depth discussion with Henrik Nordström of the Squid
> project about how HTTP mirrors and the yum tool itself could be improved
> to safely handle proxy caches.  He gave me lots of good advice about how
> HTTP mirrors can be configured for cache safety, Squid can be configured
> for yum metadata cache safety, and yum itself can be improved to be more
> robust in dealing with proxy caches.
> 
> (It turns out that Henrik is an avid Fedora user, and I might have
> convinced him to come onboard the Fedora Project to contribute another
> useful tool and become co-maintainer of his own package.  It would be an
> honor to have him onboard as a Fedora Developer. =)

 You might have had a small discussion on #yum then, as any of the
regulars there know the answers to all of your questions.

> Yum and Proxy Caches: Current Dangers
> =====================================
> Users may be using proxy servers in 3 (or more) ways:
> 
> 1) Many users today are behind a transparent proxy cache, either
> instituted by their ISP, school, or business network.
> 2) Other users might have Internet access *only* through a proxy server.
> 3) Other users might be using a reverse proxy server on their local
> network as a caching yum mirror.
> 
> There are two cases where yum has problems with proxy caches:
> 
> 1) A RPM package changes content without changing filename.  This
> usually happens only in instances where a package was pushed unsigned
> then was later signed.  A simple workaround within yum is discussed
> later in this mail.
> 
> 2) yum currently has problems with proxy caches due to common cases
> where metadata can become partially out of sync.  This happens because
> repomd.xml is grabbed often while other repodata files are grabbed less
> often.  repomd.xml is then checked for origin "freshness" more often.
> When repodata changes on the origin, repomd.xml is refreshed on the
> cache before other repodata files.  yum clients seeing the new
> repomd.xml but old primary.sqlite.bz2 error out.

 #2 is worked around as good as is possible, in the upcoming 3.2.9, in
that yum will basically create a transaction over the repomd.xml and the
metadata itself. If you use mdpolicy=group:all ... this will always
work, the downside is that you'll need to download all of the metadata
so the default is not that.

> Ideal Solution for #2 Partial Repodata Sync Problem
> ===================================================
> Henrik highly suggests using versioned repodata files as the ideal
> solution to this problem.  This way caches can serve repodata without
> fear of the sync problem, and also without querying the origin server
> upon every client download.  repomd.xml would contain changing filenames
> perhaps with timestamp or something in their filenames.
> 
> i.e.
> primary-1201140584.sqlite.bz2
> 
> This would be an elegant solution, but will it be possible for us to
> migrate to because older clients wouldn't be able to handle it?
> 
> I'm guessing not, so here are other less efficient but workable solutions.

 We've discussed this and think this is probably the best solution, but:

 1. Don't use timestamps, use the sha1 of the file, because then
multiple createrepo's runs will always create the same filenames.

 2. This requires work inside yum as atm. yum doesn't do any cleanup on
it's metadata downloads so /var/cache/yum would grow without bound
(although "yum clean ..." will work).

...we can fix #2 for 3.2.9, so we could do this in Fedora 9 onwards.

> "Cache-Control: max-age=0"
> ==========================
> This HTTP header directive can be either in the request or response.
> This instructs the proxy cache server to always query the origin HTTP
> server to check if the requested file has changed.  It compares the
> origin's reported Last-Modified or ETag to what Squid knows in its own
> cache.
> 
> This means that each and every request for repodata/* files will trigger
> a query to the origin server.  This is a relatively quick operation and
> an acceptable compromise if we cannot make repodata filenames versioned.

 This is a horrible hack, IMO, and I can pretty much guarantee that not
all of the mirrors will do this. If it was possible for us to control
all of the mirrors then we could just require them all to setup ETags
and use that ... but again, I think that's hoping for way too much.

[...]
> Yum and "X-Cache: HIT"
> ======================
> If you use wget --server-response and a target file, you see the raw
> HTTP headers of that request.  If the file is already cached, you see a
> HTTP header like below:
> 
> X-Cache: HIT from proxyserver.example.com
> 
> Proposal:
> Improve yum with the following download logic:
> 
> IF (a downloaded repodata/* file doesn't match the repomd.xml checksum
>      OR a downloaded RPM doesn't match the expected checksum)
>     AND "X-Cache: HIT from" was in its HTTP header
> THEN download it again with URLGrabber option: http_headers =
> (('Pragma', 'no-cache')
> 
> This should solve the case where RPM files legitimately change contents
> without changing filenames, like RPM signing.  This also correctly does
> NOT trigger additional downloads upon other errors like corrupted files.

 You'd have to do this change inside URLgrabber itself, as by the time
yum could react to it URLgrabber would already have decided to remove
that mirror from it's list and moved on.

-- 
James Antill <james.antill at redhat.com>
Red Hat
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/fedora-devel-list/attachments/20080124/9725000f/attachment.sig>


More information about the fedora-devel-list mailing list