[Pulp-list] Data model change: repository["packages"] to list of package ids only

Tue Feb 8 14:13:37 UTC 2011

----- Original Message -----
> On 01/31/2011 04:49 AM, John Matthews wrote:
> > I made changes to how we store packages under a repository document.
> > If there are no major objections I plan to check this in later
> > today.
> >
> > Pradeep and I have noticed a large performance issue when calling
> > "_get_existing_repo()" from repo.py. For rhel-i386-server-5 this
> > takes roughly 30 seconds to fetch information on 7k packages
> > resulting in a 10MB repository document being returned. For Fedora
> > 13 this is even larger and takes around ~90 seconds to fetch
> > somewhere around 20k packages.
> >
> > The issue is that we store a dictionary of "packages" under the
> > repository. The dictionary has a key of package id and a value of
> > the full package object. (Technically in mongo a reference to the
> > package object is stored, not the full object. When we fetch the
> > repository through pymongo the AutoReference SON Manipulator fetches
> > the contents of each package object). This results in large repos
> > being very expensive. Further pulp relies on "_get_existing_repo()"
> > in many places so this is a problem that will be seen often for
> > large repos.
> >
> > Over the weekend I made changes to how we store "packages", it's no
> > longer storing packages as a dictionary, now we only store the
> > package id in a list.
> >
> > "_get_existing_repo()" is much quicker as you can see:
> >   For rhel-i386-server-5<only package ids>:
> >    Time: .2 seconds versus ~30 seconds
> >    Size: 1.5MB versus 10MB
> >
> >   For fedora 13<only package ids>:
> >    Time: .3 seconds versus ~90 seconds
> >    Size: 2.5 MB versus 24MB
> >
> > The result of fetching a repository object now will only yield
> > "package ids" under "packages".
> > If we want to flesh out all of the package objects as the call was
> > previously doing, we can make a second call to the PackageAPI. This
> > is still much quicker than previous behavior.
> >   For rhel-i386-server-5<full package objects>:
> >    Time: ~3 seconds versus ~30 seconds
> >    Size: 10MB and 10MB
> >
> >   For fedora 13<full package objects>:
> >    Time: ~7 seconds versus ~90 seconds
> >    Size 24MB and 24MB
> >
> >
> > Developers need to be aware repo["packages"] will only contains
> > package ids. It takes one extra call to flesh out the "packages"
> > into their full objects, so if that's needed it's easy and not as
> > expensive with the new approach.
> >
> > I've made most of the changes needed for this, if there are no major
> > objections I plan to check this in today.
> >
> 
> If we had python-pymongo-1.7 you could have fixed the above
> performance
> problem with a one line change in repo.py:
> 
> diff --git a/src/pulp/server/api/repo.py b/src/pulp/server/api/repo.py
> index 2a46d6a..7ffec34 100644
> --- a/src/pulp/server/api/repo.py
> +++ b/src/pulp/server/api/repo.py
> @@ -103,7 +103,8 @@ class RepoApi(BaseApi):
> Protected helper function to look up a repository by id and
> raise a
> PulpException if it is not found.
> """
> - repo = self.repository(id, fields)
> + # Filter out the packages field because it is big
> + repo = self.repository(id, fields={"packages": 0})
> if repo is None:
> raise PulpException("No Repo with id: %s found" % id)
> return repo
> 
> http://dirolf.com/2010/06/17/pymongo-1.7-released.html
> 
> Don't be afraid to make large documents in Mongo and don't feel that
> you
> need to restructure everything to fix a performance problem.
> 
> Instead just make your queries filter out large subsets of a document
> unless they are needed.
> 

Hi Mike,

Thanks for sharing the info.  I'm in the process of upgrading to pymongo 1.9 right now, so pulp will be on it fairly soon.

I'm not sure if dropping the packages field would have been sufficient.  
Here are my biggest concerns:

1) We rely on a full fetch of the object prior to an update
2) The big performance hit I saw was related to the AutoReference manipulation and not the actual fetching of raw data. 
3) The actual usage of "packages" inside of API calls generally only required the package_id, the actual package contents weren't used.

#1 
If you look at the flow of synchronizing a repository you will see we change the repository object in several places and call repo update frequently.  Our current usage of mongo does not do partial document updates, hence we need the full document available prior to an update.  Jeff raised the issue yesterday we need to improve how we update documents in mongo, agreement was reached this area needs to be addressed. 

#2
Background:  We were storing an AutoReference to each package under the repository document "packages" field.  AutoReference is a driver level abstraction, as in it's a pymongo concept and not natively known by mongo. 

The AutoReference manipulator doesn't seem to do a batch fetch of IDs.  I believe this is the main issue for the performance numbers we are seeing.  For a fedora repo with 20k packages it was taking roughly 90 seconds to fetch a repo and the packages, yet if you fetch the ids then flesh out all of those ids it takes just a few seconds to return the same data.  

The AutoReference usage would be very useful if we were fetching package data under the repository document and modifying it, but that's not how we have been using the data.  When we make changes to a package we do so through the package api.  I felt we were paying a large performance price for a feature we weren't using.  

#3
When looking at the code inside of pulp I saw most of the usages of the repo["packages"] data was examining ids and wasn't concerned with the actual content.  Granted there are some places where content is desired, but it wasn't the popular case.  Very often we fetched the whole content and would iterate over it, ignoring everything but the id. 

Last thought on my mind was that with the current implementation of ids under repo["packages"] instead of AutoReference's I've been able to push more of the queries down to mongo and I do less in the python level. I like that.

-John