[Pulp-list] Importer Sync APIs

Wed Nov 23 05:43:52 UTC 2011

On 11/23/2011 12:07 AM, Jay Dobies wrote:
>> 1. The new sync log API looks pretty good. What I'll do is set up my
>> sync commands to log to a file on disk (since of them run in a different
>> process), then when everything is done, read that file and pass the
>> contents back in the final report.
>>
>> However, it would be nice to be able to store a "stats" mapping in
>> addition to the raw log data.
>
> Interesting. What sorts of things do you see yourself using it for?

rsync captures a few interesting stats that are useful for figuring out 
how effective it is at saving bandwidth relative to a straight download, 
so I'd just be capturing those and storing them for later analysis.

>
>> 2. I *think* the 'working directory' API is the
>> 'get_repo_storage_directory()' call on the conduit. However, I'm not
>> entirely clear on that, nor what the benefits are over using Python's
>> own tempfile module (although that may be an artefact of the requirement
>> for 2.4 compatibility in Pulp - with 2.5+, the combination of context
>> managers, tempfile.mkdtemp() and shutil.remove() means that cleaning up
>> temporary directories is a *lot* easier than it used to be)
>
> This one came out of the way we sync RPMs. I forget the exact details
> but when I spoke with the guys on our team, they said that it's easier
> on them if they could assemble the repo as part of the sync. The idea
> for the working directory over a temp directory is so we can leverage
> that state from sync to sync.
>
> To a lesser extent, this is also some paranoia on my part. Not that I
> can stop a plugin from writing to a temp directory, but I'd like to push
> a model where we can describe to a user where all Pulp related stuff is.
> If the plugins use the working directories which fall under the Pulp
> parent directory, it feels cleaner in the sense that running Pulp isn't
> throwing things all over the place.

I think it's kind of a given that random stuff can get written to /tmp 
on any system, but I do take your point. (In particular, /tmp may be on 
a relatively small partition, whereas Pulp can require that the working 
directories be stored on a partition with generous size allowances.

> That said, I may be being overly paranoid about performance without a
> good reason to be in this area. Alternatively, we could return just a
> subset of data by default but give the plugin the option to request
> "full" unit data. But the first step to all of that is reducing the
> method down to get_units which has more room to grow than get_unit_keys
> ever did.

As Jason suggests, I think the way to go here is to start with a 
"get_units()" API that just returns a flat list of ContentUnitData 
instances, then decide later if additional convenience methods make sense.

For a case like mine, where I'm only storing one content type in each 
repo, turning the list into a dictionary is pretty easy:

     units = dict((unit.id, unit) for unit in conduit.get_units())

Filtering by content type wouldn't be much more difficult:

     units = {}
     for unit in conduit.getunits():
         units.setdefault(unit.type_id, {})[unit.unit_id] = unit

>> - new_unit(type_id, key_data, other_data, relative_path) ->
>> ContentUnitData
>> Does *not* assign a unit ID (or touch the database at all)
>> Does fill in absolute path in storage_path based on relative_path
>> Replaces any use of "request_unit_filename"
>
> So it's basically a factory method that populates generated fields?
> Interesting approach. I'm not a fan of the name "new_unit" since the
> connotation (in my head at least) is that it's doing some sort of
> saving, but that can be alleviated with halfway decent docs. It also
> makes for a really nice mapping of our APIs to CRUD.

Yeah, I don't like the name either, I just didn't have any better ideas.

"preinit_unit" might be better, since it has the right connotations of 
"we don't want to save this yet, but we need help from the Pulp server 
to initialise some of the fields"

>> For the content unit lifecycle, I suggest adopting a reference counting
>> model where the importer owns one set of references (controlled via
>> save_unit/remove_unit on the importer conduit) and manual association
>> owns a second set of references (which the importer conduit can't
>> touch). A reference through either mechanism would then keep the content
>> unit alive and associated with the repository (the repo should present a
>> unified interface to other code, so client code doesn't need to care if
>> it is an importer association or a manual association that is keeping
>> the content unit alive).
>
> Implementation-wise I think it'd be a little different than you explain,
> but conceptually I like the idea of a reference owner. That would go a
> long way towards eventually supporting multiple importers as well if we
> ever needed to go that route.

Yeah, I knew I was hand-waving a lot there - definitely just trying to 
get across the concept, since I don't know anywhere near enough about 
how associations work to offer advice on implementation details.

> Currently, the scratchpad is accessible on the importer itself through
> REST:
>
> /v2/repositories/my-repo/importers/

Ah, OK. So long as it's accessible somewhere, I'm not overly worried 
about where.

Cheers,
Nick.

-- 
Nick Coghlan
Red Hat Engineering Operations, Brisbane