[Pulp-dev] Issue #2619

Fri Apr 27 21:33:27 UTC 2018

Folks,

I'd like to poll the channel for feedback about current implementation
and possible alternative(s) to it.
Issue #2619 TL;DR: report discrepancies between information kept in
Mongo and the state of (published) data kept on the disk[1]

Recent reviews are suggesting  to base the implementation on top of
relational data, kept in SQLite:
  - collect traits from a filesystem walk (checksums, sizes, link targets&paths)
  - store these traits in separate tables
  - dump Mongo into unit, distributor (configuration) and repository tables
  - query the relational data to infer any discrepancies e.g broken
symlinks, wrong sizes or checksums
  - reuse the database for generating consequent reports

Current approach[2] TL;DR:
 - assemble a validation scenario based on CLI arguments e.g:
    --check existence --check broken_rpm_symlinks --check size --check checksum
 - one by one, match the applicable content units from Mongo against
the validation scenario and filesystem traits
 - optionally skip checks that would fail for a unit e.g checksum
after invalid size
 - generate a report as a flat list of results per unit and
repository, in JSON format
 - perform consequent queries over the generated JSON report e.g
       jq '[.report[].repository] | unique' < report.json
   to get a list of affected repositories

There are some caveats with the current approach, such as:
 - in some cases, traits are first loaded into the memory from a
filesystem walk  e.g symlink targets
 - some repeated mongo queries are cached as well e.g distributors and
repositories
 - detecting broken symlinks still gives false negatives in corner cases

Cheers,
milan

[1] https://pulp.plan.io/issues/2619
[2] https://github.com/pulp/pulp_rpm/pull/1104
https://github.com/pulp/pulp/pull/3465