[Pulp-dev] 3.0 Validation update

Wed Apr 12 14:11:43 UTC 2017

On 04/11/2017 04:08 PM, Michael Hrivnak wrote:
> I like thinking about this as business logic. Data may be valid, but it may not be usable in a particular context.
> 
> To help figure out where such logic should live, it may help to think about where the check is most important.
> I've described that time as "at the time of use" earlier in this discussion (maybe just on IRC). With sync as
> an example, a workflow will load an importer's config from the database, check it for problems, and then
> immediately use the values it just inspected. This is the critical moment where it must gracefully handle
> unusable data. This check ensures correct behavior and avoids an unhandled exception or crash.

+1

> 
> We can and should also check for problems at earlier opportunities, such as at the time a user tries to queue
> a sync. This improves the user experience, but it is not required for correct and safe operation.

The data can change between the time when the task is queued and when it is executed.  This adds complexity
and the value added is not deterministic.  -1 for a check here.

> 
> Given that, I think it makes sense to put the check close to the data. A method on the model seems reasonable.
> In terms of polluting the model with business logic, it isn't that different from defining a custom query set
> on the model, which django encourages.

Agreed, but only for data integrity. Each field should be validated.  And, the record (cross field) should be
validated.  We don't want to store things in the DB that are invalid.  Seems like the django ORM already
provides for this, right?  But, the model should not have methods to validate itself for a particular
operation (usage).  For example, ensure_syncable() that validates the feed URL would not be appropriate.  This
should be checked at run-time by the object doing the sync.  Perhaps a double dispatch model in
plugin.Importer could do:

def synchronize():
    # do validation and preparation.
    return self._synchronize()

def _synchronize():
    # overridden by plugins.
    raise NotImplementedError()

> 
> As a slight tangent, some applications take this sort of checking even further. An admirable approach in REST
> API design, which may not be a good idea for us at this time but is interesting to note, is to make a behavior
> such as "sync" only available via a link accessed via a known name in an object's representation. That's a
> mouthful, so here's an example:
> 
> {
>   "id": "foo",
>   "feed": "http://cdn.redhat.com/stuff/",
>   "_links":
>   {
>     "self": "http://here.com/importers/foo",
>     "sync": "http://here.com/importers/foo/sync"
>   }
> }
> 
> Consider that the link for starting a sync is not part of the published API, except that it must be obtained
> from this representation. There are two advantages here.
> 
> The main advantage I'm pointing out is that when the server creates this representation of an Importer, it
> would only include the "sync" link if the current state of the object would allow for a sync. If there were no
> feed, there would be no sync link, and thus the client would be unable to even try starting one. So this is a
> third opportunity to check whether the object's state is suitable for a sync. It even allows the client to
> show or hide a "sync" button without having to re-implement the business logic that's already present on the
> server side. Neat, huh?

Interesting.

However, the duration of how long the 'sync' link is valid is not deterministic.  1 ms after the link is
included in the returned document, a DB commit could set it to NULL.  The link could be invalid by the time
the user even receives the REST reply.

> 
> Another advantage to this kind of approach is a smaller API surface area. We could theoretically change the
> sync URL schema at any time. We could even move it to a new, separate service. We'd still need to document how
> to use it, but it's actual location can change. In practice I don't think this aspect is all that valuable
> unless you are 100% bought in to this design. But it's fun to think about.
> 
> And to re-state, this kind of thing may not be worth our time to actually do right now, and I'm not proposing
> it. I don't know to what extent DRF would make this easy. But I wanted to bring it up for interest's sake as
> yet another place in the workflow, even closer to the end user than the other two we've discussed, where
> applications have an opportunity to utilize context checking of data.
> 
> On Tue, Apr 11, 2017 at 3:49 PM, Austin Macdonald <amacdona at redhat.com <mailto:amacdona at redhat.com>> wrote:
> 
>     Where should business logic live? As an example, I want to consider the sync task [0] and the need to
>     ensure that an importer is syncable. For now, let's say that an importer is syncable if it has a feed_url. 
> 
>     Since the call to sync an importer comes from the API, the earliest we can check that the configuration is
>     syncable is in a not-yet-written SyncView. Checking syncability here is ideal because it allows us to
>     immediately return a meaningful http response instead of happily returning information about a Task that
>     we already know is doomed to fail. 
> 
>     Performing the check in the API layer is not enough. We have discussed edge cases that lead to an
>     importer's feed_url being removed while the sync task is in the WAITING state. To make an assertion to the
>     plugin that the feed_url exists, we have to check syncability again when the Task moves into an active state.
> 
>     My thinking is that we should put this business logic on the model.
> 
>     Admittedly, it is not a clean fit with the separation of concerns philosophy but we have already violated
>     the concept by putting the sync method on the model. If sync is on the model, it seems like
>     ensure_syncable should be too. 
> 
>     If we write the platform API layer to use this kind of business logic, then the plugins can add double
>     checking business logic without modifying the API and Task layers.
> 
> 
>     [0]: https://pulp.plan.io/issues/2399 <https://pulp.plan.io/issues/2399>
> 
>     On Fri, Apr 7, 2017 at 2:14 PM, Sean Myers <sean.myers at redhat.com <mailto:sean.myers at redhat.com>> wrote:
> 
>         On 04/07/2017 12:08 PM, Brian Bouterse wrote:
>         > == questions ==
>         > * Where should ^ terms be documented?
> 
>         I'm not really sure, but recommend the wiki as a good starting point for
>         putting information that we should probably "officially" document *somewhere*,
>         but at the moment we aren't quite sure where.
> 
>         https://pulp.plan.io/projects/pulp/wiki/Pulp_3_Developer_Notes
>         <https://pulp.plan.io/projects/pulp/wiki/Pulp_3_Developer_Notes>
> 
>         > * Take the case of a sync which has overrides provided? This isn't in the
>         > MVP, but in the future it could be. In that case, does the serializer
>         > associated with the importer validate the database data with the overrides
>         > "added" on top of it?
> 
>         My question here is "validate against what?". It makes sense to validate
>         against database data, but as long as the overrides aren't themselves stored
>         in the database, what does this really stop?
> 
>         For example, what prevents two simultaneous syncs of repos using overrides
>         that would trigger a constraint violation if they were saved, but don't do
>         this because we don't save the overrides?
> 
>         > * For plugin writers writing a serializer for a subclassed Importer, do
>         > they also need to express validations for fields defined on the base
>         > Importer?
> 
>         It depends on the validation. If it's just validating an individual field,
>         no. If it's validation of a combination of values in multiple fields, and
>         one of those fields in this case was defined in a subclass, the subclass
>         will need to add the proper validation support.
> 
>         > * The database still rejects data that doesn't adhere to the data layer
>         > definition right? That occurs even without the DRF serializer correct?
> 
>         Again, this depends. For example, attempting to store an int in a charfield
>         will work, because Django will coerce that int to string on save. Attempting
>         to store a string in an IntegerField will fail, though, because Django is
>         not able to coerce str to int prior to saving. Generally, though, your
>         understanding is correct. Anything that the database can't handle will be
>         rejected.
> 
>         > * In cases where data is created in the backend, do we need to validate
>         > that as a general practice? If we do, do we call the DRF serializer
>         > regularly in the backend or just let the database reject "bad" data at the
>         > db level?
> 
>         As a general practice, I don't think so. Specifically, though, when we're
>         passing data around, like when a bit of platform code is taking incoming
>         plugin data and passing it into some standard workflow that platform provides
>         (like running sync on an importer, say) I think it's going to be a good idea
>         and an all-around gentlemenly thing to do to validate that data in some way
>         that appropriate to the process/workflow being invoked.
> 
>         I'm concerned about finding the balance between making things user-friendly
>         for plugin writers and having our checking code that provides that user-
>         friendly-ness itself be difficult to maintain and end up being pulp-developer-
>         unfriendly.
> 
> 
>         _______________________________________________
>         Pulp-dev mailing list
>         Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>         https://www.redhat.com/mailman/listinfo/pulp-dev <https://www.redhat.com/mailman/listinfo/pulp-dev>
> 
> 
> 
>     _______________________________________________
>     Pulp-dev mailing list
>     Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>     https://www.redhat.com/mailman/listinfo/pulp-dev <https://www.redhat.com/mailman/listinfo/pulp-dev>
> 
> 
> 
> 
> -- 
> 
> Michael Hrivnak
> 
> Principal Software Engineer, RHCE 
> 
> Red Hat
> 
> 
> 
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 847 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170412/6ae4f1bf/attachment.sig>