Proposed F13 feature: drop separate updates repository

Sat Dec 5 04:20:41 UTC 2009

On 12/03/2009 07:22 AM, Jesse Keating wrote:
> On Thu, 2009-12-03 at 06:24 +0100, Ralf Corsepius wrote:
>>> People doing network installs can either add the updates repo to their
>>> kickstart, or check the box in the anaconda UI, so that the updates
>>> repos are considered at install time.  No download of duplicate data.
>> Yes, for people who are doing "full featured networked installs" w/
>> custom kickstart files. I've never met such a person.
>
> Really?  I meet people who use kickstart all the time.
May-be internal at RH?

>  Any sizable
> deployment of Fedora/RHEL uses or should use kickstart.  And those that
> don't aren't afraid to check that little 'updates' box at the software
> selection screen.  You seemed to have ignored that part of my point.
No, I didn't. It's just that unless this "little check button" is the 
default, many users will ignore it or as in my case ... I am normally 
"yum-upgrading" between distros and rarely use anaconda.

>>> In fact, having separate repos would likely cost less bandwidth.  If we
>>> only had one combined repo, there would be many duplicate packages,
>> Where? Unlike now, where you have each package twice (in Everything and
>> "updates"), you would have each package only once: In Everything.
>
> That assumes we purge anything but the latest version of a package,
Correct.

> which as noted in other parts of this thread gets complicated with GPL
> compliance.
Not necessarily - E.g. it would be GPL compliant to store "purged 
packages" outside of the "current" repos.

And whether "spins" and the way they currently are implemented is 
"good"/"feasible"/"reasonable" is a different question.

>> =>  An estimate for the increase in downloaded files's sizes you are
>> talking about is ca. factor 2, from 18.2M (current "updates")
>> to 32.8M+ (current "Everything"+"newly introduced packages)
>>
>>
>> Whether this increase in download-size is "significant" is up to the
>> beholder. For me, it gets lost in the noise of accessing a "good" or a
>> "bad" connection to a mirror and in the time required to download
>> packages from mirrors.
>
> 33~ megs downloaded every single time an update is pushed is a
> significant hit for a fair number of people.
Yes, but ... some more figures:

The same figures as above for FC10:
=> Everything: 25.8M
=> updates: 18.5M

=> A rough estimate for sizes of repodata for a
"near EOL'ed" Fedora: 70% of the size of "Everything's repodata".

I.e. should this estimate apply to later Fedoras, Fedora 11 users are 
likely to see 70% of 33MB = 23MB near EOL of Fedora 11.

> That was why it was
> important to make yum not re-fetch that repodata every time, and use a
> cached version of it.
Yes, the keys to minimize bandwidth demands would be
* to shink the size of the repodata/-files
* to shink the size of "changes" to them.

Besides obvious solutions, such as using a different compression
(e.g. xz instead of bz2) and minimizing their contents, one could ship 
repodata/-files in "chunks".

What I mean: In theory, one could
* update repodata/-files incrementally by shipping some kind of "deltas".

* split repodata/-files into several, e.g. sorted by "some other 
criteria", i.e. to provide several sets of *-[primary,filelist,other] 
files. One "package push", then would only affect a subset of the files, 
but not all. - This is very similar to what (IIRC) Seth had proposed 
(Split the repo into several repos, alphabetically), except that the 
"split" would happen inside of repodata and thus be transparent to users.
I am not sure how difficult to implement this would be.

Ralf