[Pulp-list] RPM Indexes

Thu May 24 13:45:58 UTC 2012

Currently, the RPM type definition (and SRPM, DRPM) are defined with:

Unit Key:
["name", "epoch", "version", "release", "arch", "filename", 
"checksumtype", "checksum"],

Search Indexes:
["name", "epoch", "version", "release", "arch", "filename", "checksum", 
"checksumtype"]

= Background =

It's probably a good time to explain a bit more about how type 
definitions work.

The unit key is meant to be the definition of which fields compromise 
uniqueness for a given type. This is used to create a unique index on 
that type's collection in mongo.

When a unique index is created, you also get some other indexes for 
free. Given a smaller example of ["name", "version", "arch"], you 
actually get the following indexes:

name
name, version
name, version, arch

Not surprisingly, you don't get ["version", "arch"] as an index for free 
there.

As it is a unique index, we get mongo validation to prevent multiple 
inserts with the exact same key.

The search indexes field in a type definition is the ability for a type 
to define other indexes that should be created. Check out 
https://fedorahosted.org/pulp/wiki/GCUnitAssociationsStressTests for 
more information on the importance of proper indexes on searching 
capabilities.

The value in the search indexes field of a type def is a list of 
indexes. Compound indexes can be created in there by adding them to a 
list as well. So given the example above, if we wanted to add an index 
on [version, arch], as well as an index on checksum by itself, we'd 
define search indexes as:

[
   ["version", "arch"],
   "checksum"
]

= RPMs =

With that said, I think we need to change the RPM type defintion to 
remove filename from the unit key. It's not needed for uniqueness. That 
leaves the unique identifier for an RPM: name, epoch, version, release, 
arch, checksumtype, checksum.

That one is going to require code changes to stop inserting filename 
into the unit_key so it's a bit trickier to make.

Any search index changes will automatically correct themselves on server 
start (for now at least, this may prove problematic on large data sets 
but there's a note in the code on where to address that).

As for its search indexes, most of the currently defined ones are 
probably not necessary. A single index on "epoch" or "release" probably 
aren't going to be useful (I'm not sure what use listing all packages 
with an epoch of "2" would be, but correct me if I'm wrong).

We get a full NEVRA index for free from the unit key, which I think is 
an important one.

I think now is the time to figure out what we want these indexes to be, 
or at least discuss them. From talking with Prad, there is value in 
indexing filename, so I'd like to suggest the following search indexes:

[
   [filename, checksum], # get filename for free
   [name, arch],
   arch,
   version,
]

Are there any other typical query combinations we should handle? Keep in 
mind these aren't exactly free from the database's point of view so we 
need to balance capabilities with the performance hit of adding more.

-- 
Jay Dobies
Freenode: jdob @ #pulp
http://pulpproject.org | http://blog.pulpproject.org