[Pulp-list] RPM Indexes
Jay Dobies
jason.dobies at redhat.com
Thu May 24 13:45:58 UTC 2012
Currently, the RPM type definition (and SRPM, DRPM) are defined with:
Unit Key:
["name", "epoch", "version", "release", "arch", "filename",
"checksumtype", "checksum"],
Search Indexes:
["name", "epoch", "version", "release", "arch", "filename", "checksum",
"checksumtype"]
= Background =
It's probably a good time to explain a bit more about how type
definitions work.
The unit key is meant to be the definition of which fields compromise
uniqueness for a given type. This is used to create a unique index on
that type's collection in mongo.
When a unique index is created, you also get some other indexes for
free. Given a smaller example of ["name", "version", "arch"], you
actually get the following indexes:
name
name, version
name, version, arch
Not surprisingly, you don't get ["version", "arch"] as an index for free
there.
As it is a unique index, we get mongo validation to prevent multiple
inserts with the exact same key.
The search indexes field in a type definition is the ability for a type
to define other indexes that should be created. Check out
https://fedorahosted.org/pulp/wiki/GCUnitAssociationsStressTests for
more information on the importance of proper indexes on searching
capabilities.
The value in the search indexes field of a type def is a list of
indexes. Compound indexes can be created in there by adding them to a
list as well. So given the example above, if we wanted to add an index
on [version, arch], as well as an index on checksum by itself, we'd
define search indexes as:
[
["version", "arch"],
"checksum"
]
= RPMs =
With that said, I think we need to change the RPM type defintion to
remove filename from the unit key. It's not needed for uniqueness. That
leaves the unique identifier for an RPM: name, epoch, version, release,
arch, checksumtype, checksum.
That one is going to require code changes to stop inserting filename
into the unit_key so it's a bit trickier to make.
Any search index changes will automatically correct themselves on server
start (for now at least, this may prove problematic on large data sets
but there's a note in the code on where to address that).
As for its search indexes, most of the currently defined ones are
probably not necessary. A single index on "epoch" or "release" probably
aren't going to be useful (I'm not sure what use listing all packages
with an epoch of "2" would be, but correct me if I'm wrong).
We get a full NEVRA index for free from the unit key, which I think is
an important one.
I think now is the time to figure out what we want these indexes to be,
or at least discuss them. From talking with Prad, there is value in
indexing filename, so I'd like to suggest the following search indexes:
[
[filename, checksum], # get filename for free
[name, arch],
arch,
version,
]
Are there any other typical query combinations we should handle? Keep in
mind these aren't exactly free from the database's point of view so we
need to balance capabilities with the performance hit of adding more.
--
Jay Dobies
Freenode: jdob @ #pulp
http://pulpproject.org | http://blog.pulpproject.org
More information about the Pulp-list
mailing list