[Pulp-list] Importer Sync APIs

Mon Nov 21 23:33:39 UTC 2011

On 11/22/2011 07:43 AM, Jay Dobies wrote:
> http://blog.pulpproject.org/2011/11/21/importer-sync-apis/
>
> I know the week of Thanksgiving isn't the best time to ask for deep
> thought, but I'm asking anyway.

No Thanksgiving over here, so no holiday distractions for me :)

> I also know there are at least two other teams interested in writing
> plugins that I'd like to give some feedback on how this will meet their
> needs.

1. The new sync log API looks pretty good. What I'll do is set up my 
sync commands to log to a file on disk (since of them run in a different 
process), then when everything is done, read that file and pass the 
contents back in the final report.

However, it would be nice to be able to store a "stats" mapping in 
addition to the raw log data.

2. I *think* the 'working directory' API is the 
'get_repo_storage_directory()' call on the conduit. However, I'm not 
entirely clear on that, nor what the benefits are over using Python's 
own tempfile module (although that may be an artefact of the requirement 
for 2.4 compatibility in Pulp - with 2.5+, the combination of context 
managers, tempfile.mkdtemp() and shutil.remove() means that cleaning up 
temporary directories is a *lot* easier than it used to be)

3. The "request_unit_filename" and "add_or_update_content_unit" APIs 
seem oddly asymmetrical, and the "get_unit_keys_for_repo" naming 
includes quite a bit of redundancy.

To be both consistent, flexible and efficient, I suggest an API based 
around a "ContentUnitData" class with the following attributes:
   - type_id
   - unit_id (may be None when defining a new unit to be added to Pulp)
   - key_data
   - other_data
   - storage_path (may be None if no bits are stored for the content 
type - perhaps whether or not bits are stored should be part of the 
content type definition?)

The content management API itself could then look like:

- get_units() -> two level mapping {type_id: {unit_id: ContentUnitData}}
     Replacement for get_unit_keys_for_repo()
     Note that if you're concerned about exposing 'unit_id', the 
existing APIs already exposed it as the return value from 
'add_or_update_content_unit'.
     I think you're right to avoid exposing a "single lookup" API, at 
least initially - that's a performance problem waiting to happen.

- new_unit(type_id, key_data, other_data, relative_path) -> ContentUnitData
   Does *not* assign a unit ID (or touch the database at all)
   Does fill in absolute path in storage_path based on relative_path
   Replaces any use of "request_unit_filename"

- save_unit(ContentUnitData) -> ContentUnitData
   Assigns a unit ID with the unit and stores the unit in the database
   Associates the unit with the repo
   Batching will be tricky due to error handling if the save fails
   Replaces any use of 'add_or_update_content_unit' and 
'associate_content_unit'

- remove_unit(type_id, pulp_id) -> bool
   True if removed, False if association retained
   Replaces any use of 'unassociate_content_unit'

For the content unit lifecycle, I suggest adopting a reference counting 
model where the importer owns one set of references (controlled via 
save_unit/remove_unit on the importer conduit) and manual association 
owns a second set of references (which the importer conduit can't 
touch). A reference through either mechanism would then keep the content 
unit alive and associated with the repository (the repo should present a 
unified interface to other code, so client code doesn't need to care if 
it is an importer association or a manual association that is keeping 
the content unit alive).

4. It's probably worth also discussing the scratchpad API that lets the 
importer store additional state between syncs. I like having this as
a convenience API (rather than having to store everything as explicit 
metadata on the repo), but for debugging purposes, it would probably be 
good to publish "_importer_scratchpad" as a metadata attribute on the 
repo that is accessible via REST.

> This is too big and ambitious for me to get right on my own.

Definitely headed in the right direction, but I think it's worth pushing 
the "structured data" approach even further. You've already started 
doing this on the configuration side of things, I think it makes sense 
on the metadata side as well.

Cheers,
Nick.

-- 
Nick Coghlan
Red Hat Engineering Operations, Brisbane