[Pulp-list] Need help/advice with import tasks intermittently causing a time-out condition

Wed Dec 6 19:42:16 UTC 2017

That actually sounds normal if work is being dispatched slowly into Pulp.
If you expect two workers, and the /status/ API shows two workers, then it
should be healthy. I wrote some on the youtube question about this also:
https://www.youtube.com/watch?v=PpinNWOpksA&lc=UgyHs_RFkeLbU6L9HeR4AaABAg.8_qLVyV5tza8_qMzDLvKrK

On Wed, Dec 6, 2017 at 2:31 PM, Deej Howard <Deej.Howard at neulion.com> wrote:

>                 I used the qpid-stat -q utilility on my installation, and
> I saw something that confused me. I would have expected the
> resource_manager queue to have more message traffic as compared to my
> workers, but this is not the case, and in fact one of my two workers seems
> to have no message traffic at all. I suspect this indicates some sort of
> misconfiguration somewhere, does that sound correct?
>
>
>
> [root at 7d53bac13e28 /]# qpid-stat -q
>
> Queues
>
>   queue                                             dur  autoDel  excl
> msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
>
>   ============================================================
> =====================================================================
>
> …extra output omitted for brevity…
>
>   celery
> Y                      0   206    206       0    171k     171k        2
> 2
>
>   celeryev.911e1280-9618-40bb-a54f-813db11d4d3e
> Y                 0  96.9k  96.9k      0   78.3m    78.3m        1     2
>
>   pulp.task
> Y                      0     0      0       0      0        0         3
> 1
>
>   reserved_resource_worker-1 at worker1.celery.pidbox
> Y                 0     0      0       0      0        0         1     2
>
>   reserved_resource_worker-1 at worker1.dq             Y
> Y                 0     0      0       0      0        0         1     2
>
>   reserved_resource_worker-2 at worker2.celery.pidbox
> Y                 0     0      0       0      0        0         1     2
>
>   reserved_resource_worker-2 at worker2.dq             Y
> Y                 0  1.07k  1.07k      0   1.21m    1.21m        1     2
>
>   resource_manager                                  Y
>    0   533    533       0    820k     820k        1     2
>
>   resource_manager at resource_manager.celery.pidbox
> Y                 0     0      0       0      0        0         1     2
>
>   resource_manager at resource_manager.dq              Y
> Y                 0     0      0       0      0        0         1     2
>
>
>
> The pulp-admin status output definitely shows both workers and the
> resource_manager as being “discovered”, so what gives?
>
>
>
> *From:* Deej Howard [mailto:Deej.Howard at neulion.com]
> *Sent:* Tuesday, December 05, 2017 6:42 PM
> *To:* 'Dennis Kliban' <dkliban at redhat.com>
> *Cc:* 'pulp-list' <pulp-list at redhat.com>
> *Subject:* RE: [Pulp-list] Need help/advice with import tasks
> intermittently causing a time-out condition
>
>
>
>                 That video was very useful, Dennis – thanx for passing it
> on!
>
>
>
>                 It sounds like the solution to the problem I’m seeing lies
> with the client-side operations, based on the repo reservation methodology
> that is in place.  It would really be useful if there were some sort of API
> call that could be made so the client code could decide if the operation
> were just hung due to network issues (and abort or otherwise handle that
> state), or if there is an active repo reservation in place that is waiting
> to clear before the operation can proceed.  I can also appreciate that this
> has at least the potential of changing dynamically from the viewpoint of a
> client’s operations (because the repo reservation can be put on/taken off
> for other tasks that are already in the queue), and it would be good for
> the client to be able to determine that its task is progressing (or not) as
> far as getting assigned/executed.  Sounds like I need to dig deeper into
> what I can accomplish with API (or REST) to get a better idea of the exact
> status of the import operation and basing decisions more on that status
> rather than just “30 attempts every 2 seconds”.
>
>                 If nothing else, I now have a better understanding and
> some additional troubleshooting tools to track down exactly what is (and is
> not) going on!
>
>
>
> *From:* Dennis Kliban [mailto:dkliban at redhat.com <dkliban at redhat.com>]
> *Sent:* Tuesday, December 05, 2017 1:07 PM
> *To:* Deej Howard <Deej.Howard at neulion.com>
> *Cc:* pulp-list <pulp-list at redhat.com>
> *Subject:* Re: [Pulp-list] Need help/advice with import tasks
> intermittently causing a time-out condition
>
>
>
> The tasking system in Pulp locks a repository during an import of a
> content unit. If clients are uploading content to the same repository, the
> import operation has to wait for any previous imports to the same repo to
> complete. It's possible that you are not waiting long enough. Unfortunately
> this portion of Pulp is not well documented, however, there is a 40 minute
> video[0] on YouTube that provides insight into how the tasking system works
> and how to troubleshoot it.
>
> [0] https://youtu.be/PpinNWOpksA
>
>
>
> On Tue, Dec 5, 2017 at 12:43 PM, Deej Howard <Deej.Howard at neulion.com>
> wrote:
>
>                 Hi, I’m hoping someone can help me solve a strange problem
> I’m having with my Pulp installation, or at least give me a good idea where
> I should look further to get it solved.  The most irritating aspect of the
> problem is that it doesn’t reliably reproduce.
>
>                 The failure condition is realized when a client is adding
> a new artifact.  In all cases, the client is able to successfully “upload”
> the artifact to Pulp (successful according to the response from the Pulp
> server).  The problem comes in at the next step where the client directs
> Pulp to “import” the uploaded artifact, and then awaits a successful task
> result before proceeding.  This is set up within a loop;  up to 30 queries
> for a successful response to the import task are made, with a 2-second
> interval between queries.  If the import doesn’t succeed within those
> constraints, the operation is treated as having timed-out, and further
> actions with that artifact (specifically, a publish operation) are
> abandoned. Many times that algorithm works with no problem at all, but far
> too often, that successful response is not received within the 30
> iterations.  It surprises me that there would be a failure at this point,
> actually – I wouldn’t expect an “import” operation to be very complicated
> or take a lot of time (but I’m certainly not intimate with the details of
> Pulp implementation either).  Is it just a case that my expectations of the
> “import” operation are unreasonable, and I should relax the loop parameters
> to allow more attempts/more time between attempts for this to succeed?  As
> I’ve mentioned, this doesn’t always fail, I’d even go so far as to claim
> that it succeeds “most of the time”, but I need more consistency than that
> for this to be deemed production-worthy.
>
>                 I’ve tried monitoring operations using pulp-admin to make
> sure that tasks are being managed properly (they seem to be, but I’m not
> yet any sort of Pulp expert), and I’ve also monitored the Apache mod_status
> output to see if there is anything obvious (there’s not, but I’m no Apache
> expert either).  I’ve also found nothing obvious in any Pulp log output.
> I’d be deeply grateful if anyone can offer any sort of wisdom, help or
> advice on this issue, I’m at the point where I’m not sure where to look
> next to get this resolved.  I’d seriously hate to have to abandon Pulp
> because I can’t get it to perform consistently and reliably (not only
> because of the amount of work this would represent, but because I like
> working with Pulp and appreciate what it has to offer).
>
>
>
>                 I have managed to put together a test case that seems to
> reliably demonstrate the problem – sort of.  This test case uses 16 clients
> running in parallel, each of which has from 1-10 artifacts to upload (most
> clients have only 5).  When I say that it “sort of” demonstrates the
> problem, the most recent run failed on 5 of those clients (all with the
> condition mentioned above), while the previous run failed on 8, and the one
> before that on 9, with no consistency of which client will fail to upload
> which artifact.
>
>
>
> Other observations:
>
>    - Failure conditions don’t seem to have anything to do with the
>    client’s platform, geographical location, or be attached to a specific
>    client.
>    - One failure on a client doesn’t imply the next attempt from that
>    same client will also fail, in fact, more often than not it doesn’t.
>    - Failure conditions don’t seem to have anything to do with the
>    artifact being uploaded.
>    - There is no consistency around which artifact fails to upload (it’s
>    not always the first artifact from a client, or the third, etc.)
>
>
>
> Environment Details
>
>    - Pulp 2.14.3 using Docker containers based on Centos 7: one
>    Apache/Pulp API container, one Qpid message broker container, one Mongo DB
>    container, one Celery worker management container, one resource
>    manager/task assignment container, and two Pulp worker containers.  All
>    containers are running within a single Docker host, dedicated to only
>    Pulp-related operations.  The diagram at http://docs.pulpproject.org/
>    en/2.14/user-guide/scaling.html
>    <http://docs.pulpproject.org/en/2.14/user-guide/scaling.html> was used
>    as a guide for this setup.
>    - Ubuntu/Mac/Windows-based clients are using a Java application plugin
>    to do artifact uploads.  Clients are dispersed across multiple geographical
>    sites, including the same site where the Pulp server resides.
>    - Artifacts are company-proprietary (configured as a Pulp plugin), but
>    essentially are a single ZIP file with attached metadata for tracking and
>    management purposes.
>
>
> _______________________________________________
> Pulp-list mailing list
> Pulp-list at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-list
>
>
>
> _______________________________________________
> Pulp-list mailing list
> Pulp-list at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-list/attachments/20171206/44d154ba/attachment.htm>