[Pulp-dev] pulp 3 upload API validation

Tue Jul 11 23:38:13 UTC 2017

On Tue, Jul 11, 2017 at 2:23 PM, Jeff Ortel <jortel at redhat.com> wrote:

>
>
> On 07/11/2017 12:27 PM, Dennis Kliban wrote:
> > On Tue, Jul 11, 2017 at 1:20 PM, Brian Bouterse <bbouters at redhat.com
> <mailto:bbouters at redhat.com>> wrote:
> >
> >     We should not raise a validation exception because due to semver we
> can't stop raising that exception
> >     later. Specifically if we want to ever allow double checksum being
> specified in the future.
> >
> >     For the MVP, I think the choices are: only respect sha256 and ignore
> the rest OR do as many standard
> >     validators as they specify. I'm ok with either of these with a
> slight preference for validating as many of
> >     them as are specified would be the best. Since we're already
> handling all of the data at save time,
> >     pushing it through additional digest validators will cost a bit of
> cpu but not much additional I/O. Also
> >     if the user asked us to do that validation then having it demand
> some additional resources is OK. Also
> >     having feature parity with the changesets is good. The changesets
> can validate all standard digest
> >     validators so having uploads do the same would be consistent.
> >
> >
> > I am reversing what I had previously said, and I agree that we should
> not raise an exception if a user
> > provides more than one checksum at upload time. My current
> implementation checks every checksum specified by
> > the user.
> >
> >
>
> I don't think I understand the purpose of an upload user specifying more
> than 1 algorithm/digest for the
> platform to validate the integrity of an uploaded file.  Why would the
> user specify an algorithm/digest they
> don't trust?
>

There isn't a good reason for a user to specify multiple digests especially
since sha512 is a significantly stronger statistical claim than others we
support. If a user did want to do it (even because they could) I don't
think we should put in effort to stop them. I was thinking if they ask us
to validate 3 types, we can just do that blindly and reject if any of them
fail. Also we should be pre-computing all digests.

I think both sync and upload should pre-compute all the checksums on each
the Artifact fields [0]. This was a feature we had to add later for
pulp_rpm for Pulp2. We've had use cases where plugin code wants to publish
different publishes with different checksums, especially in cases where
repositories are consumed by EL5 systems which don't support sha256 which
was the default checksum. With that in mind, I believe both Artifact upload
and ChangeSets should be pre-computing all checksums automatically.
@jortel, do the ChangeSets do that currently?

[0]:
https://github.com/pulp/pulp/blob/3.0-dev/platform/pulpcore/app/models/content.py#L104-L109

>
> >
> >
> >     On Mon, Jul 10, 2017 at 5:11 PM, Jeff Ortel <jortel at redhat.com
> <mailto:jortel at redhat.com>> wrote:
> >
> >
> >
> >         On 07/10/2017 02:36 PM, Dennis Kliban wrote:
> >         > On Mon, Jul 10, 2017 at 3:26 PM, Michael Hrivnak <
> mhrivnak at redhat.com <mailto:mhrivnak at redhat.com>
> >         <mailto:mhrivnak at redhat.com <mailto:mhrivnak at redhat.com>>>
> wrote:
> >         >
> >         >
> >         >
> >         >     On Mon, Jul 10, 2017 at 3:06 PM, Dennis Kliban <
> dkliban at redhat.com <mailto:dkliban at redhat.com> <mailto:dkliban at redhat.com
> >         <mailto:dkliban at redhat.com>>> wrote:
> >         >
> >         >         The upload API for Artifacts is going to allow users
> to specify the artifact size and a digest. The
> >         >         Artifact model currently supports  'md5', 'sha1',
> 'sha224', 'sha256', 'sha384', and 'sha512' digests.
> >         >
> >         >         Do we want to let users specify more than one digest
> per upload? e.g. md5 and sha256?
> >         >
> >         >
> >         >     There may be no harm in this, but it would add complexity
> to the verification and not add much value. I'd
> >         >     stick with just one unless there's a compelling reason for
> multiple.
> >         >
> >         >
> >         > I agree. The API is going to raise a validation exception when
> more than 1 digest is provided.
> >
> >         +1
> >
> >         >
> >         >
> >         >
> >         >
> >         >
> >         >         Do we want to store all 6 digests for each Artifact?
> >         >
> >         >
> >         >     The expensive part of calculating the digests is reading
> the file. As long as you're already reading the
> >         >     entire file, which we will during verification, you may as
> well stuff the bits through multiple hashers
> >         >     (digesters?) and get all the digests. Pulp 2 has a
> function that does this:
> >         >
> >         >     https://github.com/pulp/pulp/
> blob/2.13-release/server/pulp/server/util.py#L327-L353
> >         <https://github.com/pulp/pulp/blob/2.13-release/server/pulp/
> server/util.py#L327-L353>
> >         >     <https://github.com/pulp/pulp/
> blob/2.13-release/server/pulp/server/util.py#L327-L353
> >         <https://github.com/pulp/pulp/blob/2.13-release/server/pulp/
> server/util.py#L327-L353>>
> >         >
> >         >     But we can't always guarantee that we'll have all the
> checksums available, for at least two reasons. 1) If
> >         >     in the future if we want to use yet another algorithm, we
> probably won't want to run a migration that
> >         >     re-reads every file and calculates the additional digest.
> 2) For on-demand content, we don't have it
> >         >     locally, so we can't calculate any additional checksums
> until it gets fetched.
> >         >
> >         >     So this may be one of those times where we use a
> good-ole-fashioned getter method that returns the
> >         >     requested digest if it's on the artifact, calculates it if
> not, or raises an exception if the value isn't
> >         >     available and can't be calculated.
> >         >
> >         >
> >         > For uploaded Artifacts, all of the digests will be calculated
> as the file is being processed during the
> >         > upload. So I don't think calculating all of them should incur
> significantly more cost than just one. The code
> >         > snippet from Pulp 2 looks similar to what I am doing.
> >
> >         This functionality should be a method on the Artifact and not a
> util function somewhere.
> >
> >         >
> >         > I haven't given much thought to the getter, but your idea
> sounds fine to me.
> >         > Thanks,
> >         > Dennis
> >         >
> >         >
> >         >
> >         >
> >         >     --
> >         >
> >         >     Michael Hrivnak
> >         >
> >         >     Principal Software Engineer, RHCE
> >         >
> >         >     Red Hat
> >         >
> >         >
> >         >
> >         >
> >         > _______________________________________________
> >         > Pulp-dev mailing list
> >         > Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
> >         > https://www.redhat.com/mailman/listinfo/pulp-dev <
> https://www.redhat.com/mailman/listinfo/pulp-dev>
> >         >
> >
> >
> >         _______________________________________________
> >         Pulp-dev mailing list
> >         Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
> >         https://www.redhat.com/mailman/listinfo/pulp-dev <
> https://www.redhat.com/mailman/listinfo/pulp-dev>
> >
> >
> >
> >     _______________________________________________
> >     Pulp-dev mailing list
> >     Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/pulp-dev <
> https://www.redhat.com/mailman/listinfo/pulp-dev>
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170711/41d81371/attachment.htm>