Supporting EPEL Builds in Koji

Mon Oct 6 19:14:10 UTC 2008

Mike Bonnet wrote:
> On Fri, 2008-07-18 at 11:38 -0400, Mike McLean wrote:
>> Mike Bonnet wrote:
>>> On Thu, 2008-07-17 at 13:54 -0400, Mike McLean wrote:
>>>> If the remote_repo_url data is going to be inherited (and I tend to
>>>> think it should be), then I think it should be in a separate table. 
...
>>> I don't have any problem with this, though it does mean we'll need to
>>> duplicate quite a bit of the inheritance-walking code,
...
>> Walking inheritance is just a matter of determining the inheritance 
>> order and scanning data on the parent tags in sequence.
...
> Sorry, I was referring to walking tag_inheritance.  I'd rather have one
> place that walks the inheritance hierarchy and aggregates data from it,
> than two places that are doing almost the same thing.

We're talking about inherently different data. External repos to be 
merged in are quite different from builds in the system.

> Each tag has a set of builds associated with it.  We walk the
> inheritance hierarchy, aggregating the builds from each tag in the
> hierarchy into a flat list, and then pass that list to createrepo.  We
> would do essentially the same thing for external repos.  When walking
> the hierarchy, if a tag has an external repo associated with it, we
> would append that repo url to a flat list, and pass that list to
> mergerepo.  In both cases we're working with collections of packages
> that are associated with a tag, just in different formats.

Sure, we can do this with one call to readFullInheritance, and traverse 
both the build table and external repo table from the given order.

> In discussing this with Jesse, I think we want external repos to be
> inherited.  This is probably the easiest way to deal with having
> multiple external repos getting pulled in to a single buildroot, which
> is essential for Fedora (think F9 GA and F9 Updates).
> 
> The idea was that, by convention, we would have external-repo-only tags,
> with only a single external repo associated with it and no
> packages/builds associated.  These external-repo-only tags could then be
> inserted into the build hierarchy where appropriate.  An ordered list of
> external repos could then be constructed by performing the current
> depth-first search of the inheritance hierarchy.  The ordered list would
> then be passed to mergerepo, which would ensure that packages in repos
> earlier in the list supersede packages (by srpm name) in repos later in
> the list.  This would preserve the "first-match-wins" inheritance policy
> that Koji currently implements, and that admins expect.  For example:
> 
> dist-custom-build
>   ├─dist-custom
>   └─dist-f9-updates-external
>       └─dist-f9-ga-external
> 
> would result mergerepo creating a single repo that would only contain
> packages from dist-f9-ga-external if they did not exist in the
> Koji-generated repo (dist-custom-build + dist-custom),
> dist-f9-updates-external, or the blacklist of blocked packages.  This is
> consistent with how Koji package inheritance currently works, and I
> think is the most intuitive approach.

It is similar, but different in potentially confusing ways. External 
repos do not have build structure, so we can't really have the same sort 
of inheritance behavior with a combination of external repo tags and 
normal tags.

We order the external repos in inheritance order, but ultimately those 
repos are merged with the internal one in a way that does not honor 
inheritance in the way that the admin might expect.

Using tags to represent external repos fails intuition because external 
repos are very much not like tags. When we get to supporting external 
koji systems, we can do something like this, but for external repos the 
"bolted-on" nature needs to be clear. This is why I'd prefer to have the 
data a little more removed.

>> I see all that, and I'm almost convinced. The flipside is that by 
>> default all the code will treat these external rpms the same as the 
>> local ones, which will not be correct for a number of cases. 
> 
> Personally I'd prefer adding a few special cases to the existing code,
> rather than maintain a whole heap of almost-but-not-quite-the-same code
> to manage external rpms.  I think that conceptually they're alike enough
> that the number of special cases will be minimal.

I think I'm ok with using the rpminfo table.

> I think that synthesizing builds for that sake of maintaining the
> not-null constraint is more pain than it's worth, and would make
> enforcing our nvr-uniqueness constraints (which we definitely want to do
> for local builds) more difficult.  Having locally-built rpms always
> associated with a build, and external rpms not, makes sense to me.

Ok, agreed.

>> Also, I'm thinking we need to have some sort of rpm_origin table so that 
>> all these references can be managed cleanly.
> 
> That sounds reasonable to me.  Note that we may end up with a lot of
> rows in this table, since we're allowing variable substitution in the
> external_repo_url (tag name and arch).  But I don't see that as a
> problem.

I'm thinking the only substitution we should support is arch. Anything 
else sort of constitutes a different repo.

If we use an origin table like this we can abstract out the arch. 
Something like:

create table external_repo (
	id SERIAL PRIMARY KEY,
	name TEXT );
create table external_repo_config (
	external_repo_id INTEGER NOT NULL REFERENCES external_repo (id),
	url TEXT NOT NULL,
	-- plus versioning fields
	-- ... );

This way if upstream repo changes url scheme or moves to a different 
host, you can keep some notion of connectedness. External rpms would 
simply reference external_repo_id.

>> In the same vein, what happens when an external repo has an nvra+sigmd5 
>> matching a /local/ rpm?  Maybe it doesn't matter, though I guess 
>> technically we want to record the origin properly when it gets into a 
>> buildroot via external repo vs internal tag.
> 
> Right, we would record the origin as the remote repo it came from (by
> parsing the merged repodata and looking at the baseurl).

So where do we draw the line between code that we add to koji and code 
that we add to createrepo (or some external merge-repo tool)?

>>> However, we will already be parsing the remote repodata, which contains
>>> information like the srpm name for each rpm, so we could do something
>>> more sophisticated here.
>> -snipsnip-
>> ...
>>> The repomerge tool seems like it solves the problem better, and would be
>>> more useful in general.
>> If we're going to have our fingers in the repodata, we'll probably want 
>> to have them in the merge too. Perhaps we can get createrepo and/or this 
>> repomerge tool usefully libified?
> 
> I was thinking we would probably just call out to the tool the way we do
> for createrepo, but I'm certainly not against using an API.  I'm a
> little concerned about memory usage when doing the create/mergerepo
> in-process, since we know python and mod_python have garbage-collection
> issues, but that may be a "cross the bridge when we come to it" problem.
> Seth, is it feasible to provide an API to mergerepo that we could use
> directly?

I don't think I even saw a reply from Seth on this. Where does the 
mergerepo code stand now?