[Pulp-list] QPID Issue Summary
Jeff Ortel
jortel at redhat.com
Mon Mar 28 14:18:28 UTC 2011
All,
As you all know, we have seen some issues with QPID over the past few months. I'd like to
summarize them and inform everyone were we stand. The primary symptom is that goferd
starts consuming ~100% CPU. I've also seen cases where httpd starts consuming ~60% CPU as
well. During my investigation, I've determined that there a (2) variations of this.
See attached pulp.log for error messages.
SUMMARY:
#1. python-qpid starts spewing error messages in the log which do not refer to a problem
with the broker. They instead suggest (to me) a bug in the python driver. Or, a state
mismatch between python-qpid and the broker. Currently, I cannot reproduce this on
demand. I've heard of this on both i386 and x86_64.
#2. python-qpid starts spewing error messages in the log which refer to a problem with the
broker (and/or contained queues). Basically, we're getting a "ConnectionError: Enqueue
capacity threshold exceeded on queue" on a temporary (durable) queue associated with a
topic subscription. Using qpid-stat, I determined that the queue had only 11 messages
queued. Then, using qpid-stat, I purged the queue. But, still having the same problem
which seems odd since the queue appears to be empty. I verified that even with 11
messages queued, the qpid consumer was getting an "Empty" exception trying to fetch
messages. None of this makes sense. See (attached) pulp.log. The cause of this issue
was thought to be (by qpid development team) that the root filesystem was full. However,
we're seeing this again on machines with FS not nearly full. A workaround is to uninstall
qpid-cpp-server-store. After doing this, the issue seems to be gone. Note,
qpid-cpp-server-store is required by pulp so in order to do this, you need to either run
from dev-env or hack & respin the rpm without the dep. I've heard of this on only x86_64.
NEXT STEPS:
=== #1 ===
A. I've sent an urgent email to Rafael Schloming who write the python-qpid driver asking
him to take a look at the stack traces and try to shed some light on this.
B. Considered that the driver could be getting corrupted by concurrency issues, I updated
the gofer stress test over the weekend to hammer gofer/qpid using large numbers of
threads. Each thread creates agent proxies and invokes a lot of methods. Inspection of
python-qpid code clearly shows that it /should/ be reentrant. All tests ran clean.
C. Replaced the goferd code profiler (cProfile) with YAPPI [1] in attempt to get better
profiling information beyond the main thread. cProfile & Profile suck for multi-threaded
applications and you have to figure this out the hard way. Any other suggestions
appreciated. Today, I'll reproduce condition #2 and at least try to see which
thread/function is eating the CPU.
=== #2 ===
A. So far, only know about this happening on Preethi's and Sayli's machines. I want to
recreate on another box.
B. Re-engage QPID development team to help resolve.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=688662
[2] https://bugzilla.redhat.com/show_bug.cgi?id=689573
[3] http://code.google.com/p/yappi/
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pulp.log
URL: <http://listman.redhat.com/archives/pulp-list/attachments/20110328/55a8fa0d/attachment.log>
More information about the Pulp-list
mailing list