[katello-devel] Organization deletion bug, orchestration, testing

Thu Dec 13 14:49:08 UTC 2012

Guys,

now for two days I am working on a nasty bug. When you delete an
organization, its red hat provider (by default created with name "Red
Hat") is not deleted. If you have imported any manifests there or synced
content, it is not deleted too. But the organization itself gets deleted
from our database. And there is more - orchestration is somehow broken,
so things are also not deleted in backend engines. Even if I fix the
provider deletion, deletion orchestration just does not work. Mainly
because organization deletion was not working for a while.

I think this is typical error that shows major weak points of our
orchestration code being tightly coupled with models. Organization
deletion was refactored to be a background job, because it can take a
long time to delete.

The implementation is a bit hacky - each organization has a task_id flag
and when it is set to non-nil value, organization is hidden with a
default_scope. That means once background task starts, it is immediately
invisible to both UI and CLI. 

This should be a separate process (method or something in our business
logic code), that would be started either directly or via delayed_jobs.
When quality engineers were testing Katello as a "black box", everything
was looking good.

But inside there was a background job that (due to our bug) deleted only
organization from Katello database leaving all providers, products,
repositories and stuff there. Basically it only deleted one record from
our database and then it stopped.

We have our unit tests that are able to reveal errors in units, and QA
have their system tests which tests Katello as a project. But we are
missing one important thing - integration tests. Something similar to
PulpV2 VCR test suite that is able to test all required HTTP REST calls
were made. De-facto standard in enterprise integration is the very
similar approach of "recording" interactions between systems and then
making stubs and comparing against results. By the way, I have been
trained on a software that is called Green Hat (it's proprietary but
funny name, right).

User story:

As a dev, I want decent integration test suite for all backend engines

The bug also points on our orchestration - because it is tightly coupled
with models, we have designed the orchestration deletion that hacky way.
The proper and logical approach is to start a process (or at least a
Ruby code bit) that has a procedural structure and does all necessary
things in simple steps - like one function or several sub-function
calls. In the EI world, these are processes and sub-processes.

But since our orchestration is hooked into Katello database, we tend to
rely on it for things that should definitely not be written as updates
or deletes in our database.

This example also shows how important is ability to write some
orchestration in one-way messaging pattern (katello integration is
nothing else than message handling between backend systems). For example
deletion is a typical one-way process that should either finished or
suspended until someone investigates what is wrong resuming it or
cancelling. This makes recovery much more easier. This is not my
invention, but standard approach for most integration projects.

In short, katello orchestration should be a separate component/service
with independent parties: Katello, Candlepin, Pulp, Foreman. And it
should be able to work online or as a background service allowing
request-reply or one-way MEPs (message exchange patterns).

There are existing solutions like JBoss Drools, Apache Camel, Apache
ServiceMix - all Java based. As I don't see feasible to integrate with
those, I need to insist on adding tasks that would change way how
orchestration works today.

Recovery from data inconsistency bugs is _very_ expensive. Actually
there are enterprises that have offerings solely dedicated to this
topic.

User stories:

As a dev, I want to detach orchestration from models
As a dev, I want to have clean and consistent orchestration code
Design out: Process-like orchestration with various MEPs

We are not done yet! There is more. We have lots of database hooks and
validations. For this particular deletion, there are before_delete hooks
and in those it is not sufficient to return false if there is a problem
(validation issue or general error). We must throw an exception,
otherwise Rails will not rollback the whole transaction.

User stories:

As a dev, I want all callbacks to be validated to throw errors when
transaction should be rolled back

As I will only fix for the particular BZ and I will continue
investigation about what is wrong, what data was or was not deleted in
each particular backend engine and prepare some kind of migration script
that will correct data inconsistencies, we should add those onto our
backlog, because I will only fix this particular (org deletion) case.

LZ

-- 
Later,

 Lukas "lzap" Zapletal
 #katello #systemengine