[Pulp-list] Need help/advice with import tasks intermittently causing a time-out condition

Wed Dec 6 22:38:22 UTC 2017

                Yes, that all makes sense – you went over that pretty
thoroughly in your video.  So I can see if I were only doing “sync”,
“import”, and/or “publish” operations (all against the same repo), it would
make sense to me that those might all end up going to the same worker,
since they have to be serial operations.  But “upload” doesn’t need to be
serialized, right?  And that should mean the other worker could (and
should) be called into duty for “upload” duties if/when the other one was
tied up with its serial operations.  And if that were the case, I would
expect to see some message traffic on that other worker (per qpid-stat
output) - UNLESS the “upload” operation didn’t use messaging.  This is
entirely possible – I’ve never seen any “upload” tasks listed using
pulp-admin, anyway… is that what is going on?

*From:* Brian Bouterse [mailto:bbouters at redhat.com]
*Sent:* Wednesday, December 06, 2017 3:14 PM
*To:* Deej Howard <Deej.Howard at neulion.com>
*Cc:* pulp-list <pulp-list at redhat.com>
*Subject:* Re: [Pulp-list] Need help/advice with import tasks
intermittently causing a time-out condition

Both sync and import operations affect the content associated with a
repository, so those are serialized. Consider the case of importing and
publishing content at the same time. It would be unclear if the
concurrently imported content should be included in the publish or not.

If you list the task status output (especially with the -vvv option) I
think you can see the resource lock names used. If not, there then in the
reserved_resources collection in mongo while tasks are running. Pulp
serializes two tasks with the same resource lock name. That should allow
you to explore Pulp's locking behaviors based on the label types for
different operations.

More questions are welcome!

On Wed, Dec 6, 2017 at 4:42 PM, Deej Howard <Deej.Howard at neulion.com> wrote:

                Given the test case I mention below, I would expect there
to be situations where both workers could/should be in play, and perhaps
they are if there are no messages associated with “upload” operations – is
that possible?  For example, with 16 parallel clients in operation, it
seems likely that there is at least one “import” operation in progress
(with an associated repo reservation) when another client attempts to
“upload” an artifact (then “import”, then “publish”, as the cycle goes).

                Oh, another perhaps related question, the video talked
about “sync” operations – is that the same as “import”?

*From:* Brian Bouterse [mailto:bbouters at redhat.com]
*Sent:* Wednesday, December 06, 2017 12:42 PM

*To:* Deej Howard <Deej.Howard at neulion.com>
*Cc:* pulp-list <pulp-list at redhat.com>
*Subject:* Re: [Pulp-list] Need help/advice with import tasks
intermittently causing a time-out condition

That actually sounds normal if work is being dispatched slowly into Pulp.
If you expect two workers, and the /status/ API shows two workers, then it
should be healthy. I wrote some on the youtube question about this also:
https://www.youtube.com/watch?v=PpinNWOpksA&lc=UgyHs_RFkeLbU6L9HeR4AaABAg.8_qLVyV5tza8_qMzDLvKrK

On Wed, Dec 6, 2017 at 2:31 PM, Deej Howard <Deej.Howard at neulion.com> wrote:

                I used the qpid-stat -q utilility on my installation, and I
saw something that confused me. I would have expected the resource_manager
queue to have more message traffic as compared to my workers, but this is
not the case, and in fact one of my two workers seems to have no message
traffic at all. I suspect this indicates some sort of misconfiguration
somewhere, does that sound correct?

[root at 7d53bac13e28 /]# qpid-stat -q

Queues

  queue                                             dur  autoDel  excl
msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind

=================================================================================================================================

…extra output omitted for brevity…

  celery                                            Y
0   206    206       0    171k     171k        2     2

  celeryev.911e1280-9618-40bb-a54f-813db11d4d3e          Y
0  96.9k  96.9k      0   78.3m    78.3m        1     2

  pulp.task                                         Y
0     0      0       0      0        0         3     1

  reserved_resource_worker-1 at worker1.celery.pidbox       Y
0     0      0       0      0        0         1     2

  reserved_resource_worker-1 at worker1.dq             Y    Y
0     0      0       0      0        0         1     2

  reserved_resource_worker-2 at worker2.celery.pidbox       Y
0     0      0       0      0        0         1     2

  reserved_resource_worker-2 at worker2.dq             Y    Y
0  1.07k  1.07k      0   1.21m    1.21m        1     2

  resource_manager                                  Y
   0   533    533       0    820k     820k        1     2

  resource_manager at resource_manager.celery.pidbox        Y
0     0      0       0      0        0         1     2

  resource_manager at resource_manager.dq              Y    Y
0     0      0       0      0        0         1     2

The pulp-admin status output definitely shows both workers and the
resource_manager as being “discovered”, so what gives?

*From:* Deej Howard [mailto:Deej.Howard at neulion.com]
*Sent:* Tuesday, December 05, 2017 6:42 PM
*To:* 'Dennis Kliban' <dkliban at redhat.com>
*Cc:* 'pulp-list' <pulp-list at redhat.com>
*Subject:* RE: [Pulp-list] Need help/advice with import tasks
intermittently causing a time-out condition

                That video was very useful, Dennis – thanx for passing it
on!

                It sounds like the solution to the problem I’m seeing lies
with the client-side operations, based on the repo reservation methodology
that is in place.  It would really be useful if there were some sort of API
call that could be made so the client code could decide if the operation
were just hung due to network issues (and abort or otherwise handle that
state), or if there is an active repo reservation in place that is waiting
to clear before the operation can proceed.  I can also appreciate that this
has at least the potential of changing dynamically from the viewpoint of a
client’s operations (because the repo reservation can be put on/taken off
for other tasks that are already in the queue), and it would be good for
the client to be able to determine that its task is progressing (or not) as
far as getting assigned/executed.  Sounds like I need to dig deeper into
what I can accomplish with API (or REST) to get a better idea of the exact
status of the import operation and basing decisions more on that status
rather than just “30 attempts every 2 seconds”.

                If nothing else, I now have a better understanding and some
additional troubleshooting tools to track down exactly what is (and is not)
going on!

*From:* Dennis Kliban [mailto:dkliban at redhat.com <dkliban at redhat.com>]
*Sent:* Tuesday, December 05, 2017 1:07 PM
*To:* Deej Howard <Deej.Howard at neulion.com>
*Cc:* pulp-list <pulp-list at redhat.com>
*Subject:* Re: [Pulp-list] Need help/advice with import tasks
intermittently causing a time-out condition

The tasking system in Pulp locks a repository during an import of a content
unit. If clients are uploading content to the same repository, the import
operation has to wait for any previous imports to the same repo to
complete. It's possible that you are not waiting long enough. Unfortunately
this portion of Pulp is not well documented, however, there is a 40 minute
video[0] on YouTube that provides insight into how the tasking system works
and how to troubleshoot it.

[0] https://youtu.be/PpinNWOpksA

On Tue, Dec 5, 2017 at 12:43 PM, Deej Howard <Deej.Howard at neulion.com>
wrote:

                Hi, I’m hoping someone can help me solve a strange problem
I’m having with my Pulp installation, or at least give me a good idea where
I should look further to get it solved.  The most irritating aspect of the
problem is that it doesn’t reliably reproduce.

                The failure condition is realized when a client is adding a
new artifact.  In all cases, the client is able to successfully “upload”
the artifact to Pulp (successful according to the response from the Pulp
server).  The problem comes in at the next step where the client directs
Pulp to “import” the uploaded artifact, and then awaits a successful task
result before proceeding.  This is set up within a loop;  up to 30 queries
for a successful response to the import task are made, with a 2-second
interval between queries.  If the import doesn’t succeed within those
constraints, the operation is treated as having timed-out, and further
actions with that artifact (specifically, a publish operation) are
abandoned. Many times that algorithm works with no problem at all, but far
too often, that successful response is not received within the 30
iterations.  It surprises me that there would be a failure at this point,
actually – I wouldn’t expect an “import” operation to be very complicated
or take a lot of time (but I’m certainly not intimate with the details of
Pulp implementation either).  Is it just a case that my expectations of the
“import” operation are unreasonable, and I should relax the loop parameters
to allow more attempts/more time between attempts for this to succeed?  As
I’ve mentioned, this doesn’t always fail, I’d even go so far as to claim
that it succeeds “most of the time”, but I need more consistency than that
for this to be deemed production-worthy.

                I’ve tried monitoring operations using pulp-admin to make
sure that tasks are being managed properly (they seem to be, but I’m not
yet any sort of Pulp expert), and I’ve also monitored the Apache mod_status
output to see if there is anything obvious (there’s not, but I’m no Apache
expert either).  I’ve also found nothing obvious in any Pulp log output.
I’d be deeply grateful if anyone can offer any sort of wisdom, help or
advice on this issue, I’m at the point where I’m not sure where to look
next to get this resolved.  I’d seriously hate to have to abandon Pulp
because I can’t get it to perform consistently and reliably (not only
because of the amount of work this would represent, but because I like
working with Pulp and appreciate what it has to offer).

                I have managed to put together a test case that seems to
reliably demonstrate the problem – sort of.  This test case uses 16 clients
running in parallel, each of which has from 1-10 artifacts to upload (most
clients have only 5).  When I say that it “sort of” demonstrates the
problem, the most recent run failed on 5 of those clients (all with the
condition mentioned above), while the previous run failed on 8, and the one
before that on 9, with no consistency of which client will fail to upload
which artifact.

Other observations:

   - Failure conditions don’t seem to have anything to do with the client’s
   platform, geographical location, or be attached to a specific client.
   - One failure on a client doesn’t imply the next attempt from that same
   client will also fail, in fact, more often than not it doesn’t.
   - Failure conditions don’t seem to have anything to do with the artifact
   being uploaded.
   - There is no consistency around which artifact fails to upload (it’s
   not always the first artifact from a client, or the third, etc.)

Environment Details

   - Pulp 2.14.3 using Docker containers based on Centos 7: one Apache/Pulp
   API container, one Qpid message broker container, one Mongo DB container,
   one Celery worker management container, one resource manager/task
   assignment container, and two Pulp worker containers.  All containers are
   running within a single Docker host, dedicated to only Pulp-related
   operations.  The diagram at
   http://docs.pulpproject.org/en/2.14/user-guide/scaling.html was used as
   a guide for this setup.
   - Ubuntu/Mac/Windows-based clients are using a Java application plugin
   to do artifact uploads.  Clients are dispersed across multiple geographical
   sites, including the same site where the Pulp server resides.
   - Artifacts are company-proprietary (configured as a Pulp plugin), but
   essentially are a single ZIP file with attached metadata for tracking and
   management purposes.

_______________________________________________
Pulp-list mailing list
Pulp-list at redhat.com
https://www.redhat.com/mailman/listinfo/pulp-list

_______________________________________________
Pulp-list mailing list
Pulp-list at redhat.com
https://www.redhat.com/mailman/listinfo/pulp-list
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-list/attachments/20171206/fe3cd222/attachment.htm>