[Pulp-list] Need help/advice with import tasks intermittently causing a time-out condition

Wed Dec 6 22:14:00 UTC 2017

Both sync and import operations affect the content associated with a
repository, so those are serialized. Consider the case of importing and
publishing content at the same time. It would be unclear if the
concurrently imported content should be included in the publish or not.

If you list the task status output (especially with the -vvv option) I
think you can see the resource lock names used. If not, there then in the
reserved_resources collection in mongo while tasks are running. Pulp
serializes two tasks with the same resource lock name. That should allow
you to explore Pulp's locking behaviors based on the label types for
different operations.

More questions are welcome!

On Wed, Dec 6, 2017 at 4:42 PM, Deej Howard <Deej.Howard at neulion.com> wrote:

>                 Given the test case I mention below, I would expect there
> to be situations where both workers could/should be in play, and perhaps
> they are if there are no messages associated with “upload” operations – is
> that possible?  For example, with 16 parallel clients in operation, it
> seems likely that there is at least one “import” operation in progress
> (with an associated repo reservation) when another client attempts to
> “upload” an artifact (then “import”, then “publish”, as the cycle goes).
>
>
>
>                 Oh, another perhaps related question, the video talked
> about “sync” operations – is that the same as “import”?
>
>
>
> *From:* Brian Bouterse [mailto:bbouters at redhat.com]
> *Sent:* Wednesday, December 06, 2017 12:42 PM
>
> *To:* Deej Howard <Deej.Howard at neulion.com>
> *Cc:* pulp-list <pulp-list at redhat.com>
> *Subject:* Re: [Pulp-list] Need help/advice with import tasks
> intermittently causing a time-out condition
>
>
>
> That actually sounds normal if work is being dispatched slowly into Pulp.
> If you expect two workers, and the /status/ API shows two workers, then it
> should be healthy. I wrote some on the youtube question about this also:
> https://www.youtube.com/watch?v=PpinNWOpksA&lc=UgyHs_
> RFkeLbU6L9HeR4AaABAg.8_qLVyV5tza8_qMzDLvKrK
>
>
>
> On Wed, Dec 6, 2017 at 2:31 PM, Deej Howard <Deej.Howard at neulion.com>
> wrote:
>
>                 I used the qpid-stat -q utilility on my installation, and
> I saw something that confused me. I would have expected the
> resource_manager queue to have more message traffic as compared to my
> workers, but this is not the case, and in fact one of my two workers seems
> to have no message traffic at all. I suspect this indicates some sort of
> misconfiguration somewhere, does that sound correct?
>
>
>
> [root at 7d53bac13e28 /]# qpid-stat -q
>
> Queues
>
>   queue                                             dur  autoDel  excl
> msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
>
>   ============================================================
> =====================================================================
>
> …extra output omitted for brevity…
>
>   celery
> Y                      0   206    206       0    171k     171k        2
> 2
>
>   celeryev.911e1280-9618-40bb-a54f-813db11d4d3e
> Y                 0  96.9k  96.9k      0   78.3m    78.3m        1     2
>
>   pulp.task
> Y                      0     0      0       0      0        0         3
> 1
>
>   reserved_resource_worker-1 at worker1.celery.pidbox
> Y                 0     0      0       0      0        0         1     2
>
>   reserved_resource_worker-1 at worker1.dq             Y
> Y                 0     0      0       0      0        0         1     2
>
>   reserved_resource_worker-2 at worker2.celery.pidbox
> Y                 0     0      0       0      0        0         1     2
>
>   reserved_resource_worker-2 at worker2.dq             Y
> Y                 0  1.07k  1.07k      0   1.21m    1.21m        1     2
>
>   resource_manager                                  Y
>    0   533    533       0    820k     820k        1     2
>
>   resource_manager at resource_manager.celery.pidbox
> Y                 0     0      0       0      0        0         1     2
>
>   resource_manager at resource_manager.dq              Y
> Y                 0     0      0       0      0        0         1     2
>
>
>
> The pulp-admin status output definitely shows both workers and the
> resource_manager as being “discovered”, so what gives?
>
>
>
> *From:* Deej Howard [mailto:Deej.Howard at neulion.com]
> *Sent:* Tuesday, December 05, 2017 6:42 PM
> *To:* 'Dennis Kliban' <dkliban at redhat.com>
> *Cc:* 'pulp-list' <pulp-list at redhat.com>
> *Subject:* RE: [Pulp-list] Need help/advice with import tasks
> intermittently causing a time-out condition
>
>
>
>                 That video was very useful, Dennis – thanx for passing it
> on!
>
>
>
>                 It sounds like the solution to the problem I’m seeing lies
> with the client-side operations, based on the repo reservation methodology
> that is in place.  It would really be useful if there were some sort of API
> call that could be made so the client code could decide if the operation
> were just hung due to network issues (and abort or otherwise handle that
> state), or if there is an active repo reservation in place that is waiting
> to clear before the operation can proceed.  I can also appreciate that this
> has at least the potential of changing dynamically from the viewpoint of a
> client’s operations (because the repo reservation can be put on/taken off
> for other tasks that are already in the queue), and it would be good for
> the client to be able to determine that its task is progressing (or not) as
> far as getting assigned/executed.  Sounds like I need to dig deeper into
> what I can accomplish with API (or REST) to get a better idea of the exact
> status of the import operation and basing decisions more on that status
> rather than just “30 attempts every 2 seconds”.
>
>                 If nothing else, I now have a better understanding and
> some additional troubleshooting tools to track down exactly what is (and is
> not) going on!
>
>
>
> *From:* Dennis Kliban [mailto:dkliban at redhat.com <dkliban at redhat.com>]
> *Sent:* Tuesday, December 05, 2017 1:07 PM
> *To:* Deej Howard <Deej.Howard at neulion.com>
> *Cc:* pulp-list <pulp-list at redhat.com>
> *Subject:* Re: [Pulp-list] Need help/advice with import tasks
> intermittently causing a time-out condition
>
>
>
> The tasking system in Pulp locks a repository during an import of a
> content unit. If clients are uploading content to the same repository, the
> import operation has to wait for any previous imports to the same repo to
> complete. It's possible that you are not waiting long enough. Unfortunately
> this portion of Pulp is not well documented, however, there is a 40 minute
> video[0] on YouTube that provides insight into how the tasking system works
> and how to troubleshoot it.
>
> [0] https://youtu.be/PpinNWOpksA
>
>
>
> On Tue, Dec 5, 2017 at 12:43 PM, Deej Howard <Deej.Howard at neulion.com>
> wrote:
>
>                 Hi, I’m hoping someone can help me solve a strange problem
> I’m having with my Pulp installation, or at least give me a good idea where
> I should look further to get it solved.  The most irritating aspect of the
> problem is that it doesn’t reliably reproduce.
>
>                 The failure condition is realized when a client is adding
> a new artifact.  In all cases, the client is able to successfully “upload”
> the artifact to Pulp (successful according to the response from the Pulp
> server).  The problem comes in at the next step where the client directs
> Pulp to “import” the uploaded artifact, and then awaits a successful task
> result before proceeding.  This is set up within a loop;  up to 30 queries
> for a successful response to the import task are made, with a 2-second
> interval between queries.  If the import doesn’t succeed within those
> constraints, the operation is treated as having timed-out, and further
> actions with that artifact (specifically, a publish operation) are
> abandoned. Many times that algorithm works with no problem at all, but far
> too often, that successful response is not received within the 30
> iterations.  It surprises me that there would be a failure at this point,
> actually – I wouldn’t expect an “import” operation to be very complicated
> or take a lot of time (but I’m certainly not intimate with the details of
> Pulp implementation either).  Is it just a case that my expectations of the
> “import” operation are unreasonable, and I should relax the loop parameters
> to allow more attempts/more time between attempts for this to succeed?  As
> I’ve mentioned, this doesn’t always fail, I’d even go so far as to claim
> that it succeeds “most of the time”, but I need more consistency than that
> for this to be deemed production-worthy.
>
>                 I’ve tried monitoring operations using pulp-admin to make
> sure that tasks are being managed properly (they seem to be, but I’m not
> yet any sort of Pulp expert), and I’ve also monitored the Apache mod_status
> output to see if there is anything obvious (there’s not, but I’m no Apache
> expert either).  I’ve also found nothing obvious in any Pulp log output.
> I’d be deeply grateful if anyone can offer any sort of wisdom, help or
> advice on this issue, I’m at the point where I’m not sure where to look
> next to get this resolved.  I’d seriously hate to have to abandon Pulp
> because I can’t get it to perform consistently and reliably (not only
> because of the amount of work this would represent, but because I like
> working with Pulp and appreciate what it has to offer).
>
>
>
>                 I have managed to put together a test case that seems to
> reliably demonstrate the problem – sort of.  This test case uses 16 clients
> running in parallel, each of which has from 1-10 artifacts to upload (most
> clients have only 5).  When I say that it “sort of” demonstrates the
> problem, the most recent run failed on 5 of those clients (all with the
> condition mentioned above), while the previous run failed on 8, and the one
> before that on 9, with no consistency of which client will fail to upload
> which artifact.
>
>
>
> Other observations:
>
>    - Failure conditions don’t seem to have anything to do with the
>    client’s platform, geographical location, or be attached to a specific
>    client.
>    - One failure on a client doesn’t imply the next attempt from that
>    same client will also fail, in fact, more often than not it doesn’t.
>    - Failure conditions don’t seem to have anything to do with the
>    artifact being uploaded.
>    - There is no consistency around which artifact fails to upload (it’s
>    not always the first artifact from a client, or the third, etc.)
>
>
>
> Environment Details
>
>    - Pulp 2.14.3 using Docker containers based on Centos 7: one
>    Apache/Pulp API container, one Qpid message broker container, one Mongo DB
>    container, one Celery worker management container, one resource
>    manager/task assignment container, and two Pulp worker containers.  All
>    containers are running within a single Docker host, dedicated to only
>    Pulp-related operations.  The diagram at http://docs.pulpproject.org/
>    en/2.14/user-guide/scaling.html
>    <http://docs.pulpproject.org/en/2.14/user-guide/scaling.html> was used
>    as a guide for this setup.
>    - Ubuntu/Mac/Windows-based clients are using a Java application plugin
>    to do artifact uploads.  Clients are dispersed across multiple geographical
>    sites, including the same site where the Pulp server resides.
>    - Artifacts are company-proprietary (configured as a Pulp plugin), but
>    essentially are a single ZIP file with attached metadata for tracking and
>    management purposes.
>
>
> _______________________________________________
> Pulp-list mailing list
> Pulp-list at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-list
>
>
>
>
> _______________________________________________
> Pulp-list mailing list
> Pulp-list at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-list
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-list/attachments/20171206/032a12fe/attachment.htm>