[Pulp-dev] How docker type repositories are getting synced in pulp3 w.r.t concurrency ?

Ina Panova ipanova at redhat.com
Tue Mar 30 12:07:35 UTC 2021


--------
Regards,

Ina Panova
Senior Software Engineer| Pulp| Red Hat Inc.

"Do not go where the path may lead,
 go instead where there is no path and leave a trail."


On Tue, Mar 30, 2021 at 10:12 AM Sayan Das <saydas at redhat.com> wrote:

> Hello Matthias,
>
> Thanks for your response on this one.
>
> By this,
> ~~
> Since those stages use python async and asyncio this means, there will be
> 5 parallel downloads (as long as enough requests flow by that stage). Once
> an artifact is downloaded, the next stage will transfer it to the final
> storage location (may be a cloud storage), and so on.
> ~~
>
> Should I assume that, once 5 parallel download gets completed inside the
> /var/lib/pulp/tmp , they will be immediately be transferred to their actual
> location and then only the next batch of download will start?
>
> This question is being raised based on our old experience with pulp 2,
> where a 50+ GB openshift repo was being synced, /var/cache/pulp was of only
> 25 GB and during the content download part only the filesystem got filled
> up and eventually, the task got canceled with disk-space error. It happened
> as pulp2 used to download the data in batches of 5 but it never moved the
> data to their destination until the entire repository was downloaded in
> pulp cache. This was only noticed with docker\ISO\file type repos but NOT
> with yum\rpm type repos.
>

I can give some background why it is been this way in pulp2 -  docker repos
are composed out of manifests and blobs and while an rpm is usable as it -
if there is failure of docker sync somewhere in the middle, the end result
of the mirrored content will leave the user/customer with the corrupted and
unusable repo - if one blob is missing one cannot pull and instantiate the
container from it.
This overprotective behaviour on one hand causes pulp cache directory to
require quite some space, on the other hand it ensures the docker repo is
not corrupted and contains whether all content or none.

With Pulp3 we should probably move the content to the storage as it becomes
available https://pulp.plan.io/issues/8295, however, in case of sync
failure it will be on the user to re-trigger repo sync to ensure the task
success.


>
>
> Thanks & Regards,
>
> Sayan das
>
> *T*echnical *S*upport *E*ngineer, RHCE
>
> Red Hat India
> <https://www.redhat.com/>
>
> Red Hat India Pvt. Ltd, Level-5, Tower-10, Cyber City
>
> Magarpatta City Hadapsar, Pune-411013, Maharashtra, India.
>
> saydas at redhat.com    M: +91-7890892756     IRC: Sayan
> <https://red.ht/sig>
>
>
> On Tue, Mar 30, 2021 at 1:25 PM Matthias Dellweg <mdellweg at redhat.com>
> wrote:
>
>> I am not quite sure, i understand the right notion of the question, but
>> i'll try to give my view of it.
>> Pulp 3 has a special asynchronous sync pipeline. That means on synching a
>> remote repository (regardless of it's type) there is a pipeline with so
>> called stages. The first stage is supposed to fetch metadata and enumerate
>> content units (blobs, manifests, rpms, files, ...) and pass them into the
>> pipeline. The other stages that run in parallel will each perform one of
>> downloading artifacts, saving them, assemble content units, saving them,
>> adding them to the new repository version.
>> Since those stages use python async and asyncio this means, there will be
>> 5 parallel downloads (as long as enough requests flow by that stage). Once
>> an artifact is downloaded, the next stage will transfer it to the final
>> storage location (may be a cloud storage), and so on. For performance
>> reasons however, some stages (doing database saves) will batch their work
>> into large batches (>= 100).
>> In short: It's different.
>> I hope this explains (high level) what's going on there.
>> Feel free to ask for more detail.
>>
>> On Mon, Mar 29, 2021 at 4:48 PM Sayan Das <saydas at redhat.com> wrote:
>>
>>> Hello Everyone,
>>>
>>> I am not sure if my previous email was successfully delivered or not and
>>> hence I am re-sending it.
>>>
>>> I hope someone will be able to help me with some clarification there.
>>>
>>>
>>> Thanks & Regards,
>>>
>>> Sayan das
>>>
>>> *T*echnical *S*upport *E*ngineer, RHCE
>>>
>>> Red Hat India
>>> <https://www.redhat.com/>
>>>
>>> Red Hat India Pvt. Ltd, Level-5, Tower-10, Cyber City
>>>
>>> Magarpatta City Hadapsar, Pune-411013, Maharashtra, India.
>>>
>>> saydas at redhat.com    M: +91-7890892756     IRC: Sayan
>>> <https://red.ht/sig>
>>>
>>>
>>> On Sat, Mar 27, 2021 at 12:17 AM Sayan Das <saydas at redhat.com> wrote:
>>>
>>>> Hello All,
>>>>
>>>> I hope this email finds you all well.
>>>>
>>>> My name is Sayan and I work as a support engineer for the Red Hat
>>>> Satellite 6 product. During a recent interaction with my colleague Ian
>>>> Ballou, we came across a pulp2-vs-pulp3 question that we are looking for
>>>> clarification on and It was suggested that this pulp-dev will be a really
>>>> great place to get that clarification.
>>>>
>>>> Please allow me to explain the pulp 2 behavior.
>>>>
>>>> Some parameters to consider:
>>>>
>>>> Repo Type: Docker or Openshift repo [ Assuming it has 200 units to get
>>>> downloaded ]
>>>> Download Dir: /var/cache/pulp
>>>> Data Dir: /var/lib/pulp/content/units/
>>>> Download concurrency: 5
>>>>
>>>> Now,
>>>>    * Sync Started for the repo.
>>>>    * pulp downloaded 5 units in the "Download Dir" but never moved them
>>>> in "Data Dir"
>>>>    * Once those first 5 units were downloaded, Pulp downloads the next
>>>> 5 units and the same cycle keeps on repeating untill all 200 units have
>>>> been downloaded.
>>>>    * When all 200 units are downloaded, then the entire content will be
>>>> moved from "Download Dir" to the respective location inside "Data Dir"
>>>>
>>>>
>>>> For pulp 3,
>>>>
>>>> Download Dir: /var/lib/pulp/tmp
>>>> Data Dir: /var/lib/pulp/media
>>>> Download concurrency: 5 [ I heard it's 10 but let's assume it's 5 for
>>>> now ]
>>>>
>>>>
>>>> So the question is, Will pulp 3 behave the same as pulp 2, i.e.
>>>> download the entire repository inside "Download Dir" by the batches of 5
>>>> units and then move the entire repository to "Data Dir" or the behavior is
>>>> different i.e. after download 5 units in "Download Dir" the content will be
>>>> moved to "Data Dir" and then the next 5 units will be downloaded?
>>>>
>>>> Please note, I have specifically mentioned that the repo is a
>>>> Docker\Openshift type repo as we are concerned about only Docker\ISO\File
>>>> type repos at this moment.
>>>>
>>>> Any clarification that can be provided on this will be really
>>>> appreciated.
>>>>
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Sayan das
>>>>
>>>> *T*echnical *S*upport *E*ngineer, RHCE
>>>>
>>>> Red Hat India
>>>> <https://www.redhat.com/>
>>>>
>>>> Red Hat India Pvt. Ltd, Level-5, Tower-10, Cyber City
>>>>
>>>> Magarpatta City Hadapsar, Pune-411013, Maharashtra, India.
>>>>
>>>> saydas at redhat.com    M: +91-7890892756     IRC: Sayan
>>>> <https://red.ht/sig>
>>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://listman.redhat.com/mailman/listinfo/pulp-dev
>>>
>> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://listman.redhat.com/mailman/listinfo/pulp-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20210330/ed180e4a/attachment.htm>


More information about the Pulp-dev mailing list