[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Pulp-list] RPM Indexes

Currently, the RPM type definition (and SRPM, DRPM) are defined with:

Unit Key:
["name", "epoch", "version", "release", "arch", "filename", "checksumtype", "checksum"],

Search Indexes:
["name", "epoch", "version", "release", "arch", "filename", "checksum", "checksumtype"]

= Background =

It's probably a good time to explain a bit more about how type definitions work.

The unit key is meant to be the definition of which fields compromise uniqueness for a given type. This is used to create a unique index on that type's collection in mongo.

When a unique index is created, you also get some other indexes for free. Given a smaller example of ["name", "version", "arch"], you actually get the following indexes:

name, version
name, version, arch

Not surprisingly, you don't get ["version", "arch"] as an index for free there.

As it is a unique index, we get mongo validation to prevent multiple inserts with the exact same key.

The search indexes field in a type definition is the ability for a type to define other indexes that should be created. Check out https://fedorahosted.org/pulp/wiki/GCUnitAssociationsStressTests for more information on the importance of proper indexes on searching capabilities.

The value in the search indexes field of a type def is a list of indexes. Compound indexes can be created in there by adding them to a list as well. So given the example above, if we wanted to add an index on [version, arch], as well as an index on checksum by itself, we'd define search indexes as:

  ["version", "arch"],

= RPMs =

With that said, I think we need to change the RPM type defintion to remove filename from the unit key. It's not needed for uniqueness. That leaves the unique identifier for an RPM: name, epoch, version, release, arch, checksumtype, checksum.

That one is going to require code changes to stop inserting filename into the unit_key so it's a bit trickier to make.

Any search index changes will automatically correct themselves on server start (for now at least, this may prove problematic on large data sets but there's a note in the code on where to address that).

As for its search indexes, most of the currently defined ones are probably not necessary. A single index on "epoch" or "release" probably aren't going to be useful (I'm not sure what use listing all packages with an epoch of "2" would be, but correct me if I'm wrong).

We get a full NEVRA index for free from the unit key, which I think is an important one.

I think now is the time to figure out what we want these indexes to be, or at least discuss them. From talking with Prad, there is value in indexing filename, so I'd like to suggest the following search indexes:

  [filename, checksum], # get filename for free
  [name, arch],

Are there any other typical query combinations we should handle? Keep in mind these aren't exactly free from the database's point of view so we need to balance capabilities with the performance hit of adding more.

Jay Dobies
Freenode: jdob @ #pulp
http://pulpproject.org | http://blog.pulpproject.org

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]