[Pulp-dev] Concerns about bulk_create and PostgreSQL

Wed Jan 9 18:14:05 UTC 2019

On Wed, Jan 09, 2019 at 08:46:18AM -0500, David Davis wrote:
>    The Rubygems api includes sha as part of the metadata for a gem.
>    Couldn't you use that as part of the natural key?
>    I'm surprised that Chef's supermarket API doesn't include this as well.
>    Maybe we could open a feature request?

W.r.t.  the Supermarket API I haven't found anything (neither in the
documentation nor in the actual responses).  We could open a feature
request, but there isn't only the official Supermarket, there is also
Chef Server, Minimart, (the deprecated) berkshelf API server and
JFrog Artifactory.  I wouldn't like to depend on a cutting edge
feature of the Supermarket API, which other upstream servers have to
incorporate to be usable with pulp_cookbook.

Moreover, the current metadata is provided as one big chunk of data
(single request to 'universe' endpoint).  The digest would probably be
obtained by extending the response to GET detailed information about
a specific cookbook version (i.e.  name and version).  I plan to
enrich metadata using this endpoint at some point in time, but I want
to make it optional (for Chef Supermarket, this would cause thousands
of requests during an initial repo sync even when using on_demand
policy)

>    David
> 
>    On Tue, Jan 8, 2019 at 2:50 PM Simon Baatz <[1]gmbnomis at gmail.com>
>    wrote:
> 
>      On 08.01.2019 17:16, Jeff Ortel wrote:
>      >
>      >
>      > On 1/3/19 1:28 PM, Simon Baatz wrote:
>      >> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:
>      >>>     I don't think that using integer ids with bulk_create and
>      >>> supporting
>      >>>     mysql/mariadb are necessarily mutually exclusive. I think
>      there
>      >>> might
>      >>>     be a way to find the records created using bulk_create if we
>      >>> know the
>      >>>     natural key. It might be more performant than using UUIDs as
>      well.
>      >> This assumes that there is a natural key.  For content types with
>      no
>      >> digest information in the meta data, there may be a natural key
>      >> for content within a repo version only, but no natural key for
>      the
>      >> overall content.  (If we want to support non-immediate modes for
>      such
>      >> content.  In immediate mode, a digest can be computed from the
>      >> associated artifact(s)).
>      >
>      > Can you give some examples of Content without a natural key?
>      For example, the meta-data obtained for Cookbooks is "version" and
>      "name" (the same seems to apply to Ruby Gems). With immediate sync
>      policy, we can add a digest to each content unit as we know the
>      digest
>      of the associated artifact. Thus, the natural key is "version",
>      "name",
>      and "digest"
>      In "non-immediate mode", we only have "version" and "name" to work
>      with
>      during sync. Now, there is a trade-off (I think) and we have the
>      following possibilities:
>      1. Just pretend that "version" and "name" are unique. We have a
>      natural
>      key, but it may lead to the cross-repo effects that I described a
>      while
>      ago on the list.
>      2. Use "version" and "name" as natural key within a repo version,
>      but
>      not globally. In this scenario, it may turn out that two content
>      units
>      are actually the same after downloading.
>      I prefer option 2: Content sharing is not perfect, but as a user, I
>      don't have to fear that repositories mix-up content that happens to
>      have
>      the same name and version.
>      There is also an extension of 2., which allows content sharing
>      during
>      sync for immediate mode. Define a "pseudo" natural key on global
>      content level: "version", "name" and "digest". "digest" may be null.
>      Two
>      content units are considered the same if they match in all three
>      attributes and these attributes are not null. But even in immediate
>      mode, the artifact will not be downloaded if "name" and "version"
>      are
>      already present in the repository version the sync is based on. A
>      pipeline for this could look like:
>          def pipeline_stages(self, new_version):
>              pipeline = [
>                  self.first_stage,
>                  QueryExistingContentUnits(new_version=new_version),
>                  ExistingContentNeedsNoArtifacts()
>              ]
>              if self.download_artifacts:
>                  pipeline.extend([ArtifactDownloader(), ArtifactSaver(),
>                                   UpdateContentWithDownloadResult(),
>      QueryExistingContentUnits()])
>              pipeline.extend([ContentUnitSaver()])
>              return pipeline
>      QueryExistingContentUnits(new_version=new_version) associates based
>      on
>      the "repo version key",
>      QueryExistingContentUnits() associates globally based on the "pseudo
>      natural key" (digest must be set to match at all)
>      _______________________________________________
>      Pulp-dev mailing list
>      [2]Pulp-dev at redhat.com
>      [3]https://www.redhat.com/mailman/listinfo/pulp-dev
> 
> References
> 
>    1. mailto:gmbnomis at gmail.com
>    2. mailto:Pulp-dev at redhat.com
>    3. https://www.redhat.com/mailman/listinfo/pulp-dev