[Pulp-dev] Concerns about bulk_create and PostgreSQL

Tue Jan 8 19:49:41 UTC 2019

On 08.01.2019 17:16, Jeff Ortel wrote:
>
>
> On 1/3/19 1:28 PM, Simon Baatz wrote:
>> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:
>>>     I don't think that using integer ids with bulk_create and
>>> supporting
>>>     mysql/mariadb are necessarily mutually exclusive. I think there
>>> might
>>>     be a way to find the records created using bulk_create if we
>>> know the
>>>     natural key. It might be more performant than using UUIDs as well.
>> This assumes that there is a natural key.  For content types with no
>> digest information in the meta data, there may be a natural key
>> for content within a repo version only, but no natural key for the
>> overall content.  (If we want to support non-immediate modes for such
>> content.  In immediate mode, a digest can be computed from the
>> associated artifact(s)).
>
> Can you give some examples of Content without a natural key?

For example, the meta-data obtained for Cookbooks is "version" and
"name" (the same seems to apply to Ruby Gems). With immediate sync
policy, we can add a digest to each content unit as we know the digest
of the associated artifact. Thus, the natural key is "version", "name",
and "digest"

In "non-immediate mode", we only have "version" and "name" to work with
during sync. Now, there is a trade-off (I think) and we have the
following possibilities:

1. Just pretend that "version" and "name" are unique. We have a natural
key, but it may lead to the cross-repo effects that I described a while
ago on the list.
2. Use "version" and "name" as natural key within a repo version, but
not globally. In this scenario, it may turn out that two content units
are actually the same after downloading.

I prefer option 2: Content sharing is not perfect, but as a user, I
don't have to fear that repositories mix-up content that happens to have
the same name and version.

There is also an extension of 2., which allows content sharing during
sync for immediate mode. Define a "pseudo" natural key on global 
content level: "version", "name" and "digest". "digest" may be null. Two
content units are considered the same if they match in all three
attributes and these attributes are not null. But even in immediate
mode, the artifact will not be downloaded if "name" and "version" are
already present in the repository version the sync is based on. A
pipeline for this could look like:

    def pipeline_stages(self, new_version):
        pipeline = [
            self.first_stage,
            QueryExistingContentUnits(new_version=new_version),
            ExistingContentNeedsNoArtifacts()
        ]
        if self.download_artifacts:
            pipeline.extend([ArtifactDownloader(), ArtifactSaver(),
                             UpdateContentWithDownloadResult(),
QueryExistingContentUnits()])
        pipeline.extend([ContentUnitSaver()])
        return pipeline

QueryExistingContentUnits(new_version=new_version) associates based on
the "repo version key",
QueryExistingContentUnits() associates globally based on the "pseudo
natural key" (digest must be set to match at all)