[Pulp-list] QPID Issue Summary

Mon Mar 28 14:18:28 UTC 2011

All,

As you all know, we have seen some issues with QPID over the past few months.  I'd like to 
summarize them and inform everyone were we stand.  The primary symptom is that goferd 
starts consuming ~100% CPU.  I've also seen cases where httpd starts consuming ~60% CPU as 
well.  During my investigation, I've determined that there a (2) variations of this.

See attached pulp.log for error messages.

SUMMARY:

#1. python-qpid starts spewing error messages in the log which do not refer to a problem 
with the broker.  They instead suggest (to me) a bug in the python driver.  Or, a state 
mismatch between python-qpid and the broker.  Currently, I cannot reproduce this on 
demand.  I've heard of this on both i386 and x86_64.

#2. python-qpid starts spewing error messages in the log which refer to a problem with the 
broker (and/or contained queues).  Basically, we're getting a "ConnectionError: Enqueue 
capacity threshold exceeded on queue" on a temporary (durable) queue associated with a 
topic subscription.  Using qpid-stat, I determined that the queue had only 11 messages 
queued.  Then, using qpid-stat, I purged the queue.  But, still having the same problem 
which seems odd since the queue appears to be empty.  I verified that even with 11 
messages queued, the qpid consumer was getting an "Empty" exception trying to fetch 
messages.  None of this makes sense.  See (attached) pulp.log.  The cause of this issue 
was thought to be (by qpid development team) that the root filesystem was full.  However, 
we're seeing this again on machines with FS not nearly full.  A workaround is to uninstall 
qpid-cpp-server-store.  After doing this, the issue seems to be gone.  Note, 
qpid-cpp-server-store is required by pulp so in order to do this, you need to either run 
from dev-env or hack & respin the rpm without the dep.   I've heard of this on only x86_64.

NEXT STEPS:

=== #1 ===
A. I've sent an urgent email to Rafael Schloming who write the python-qpid driver asking 
him to take a look at the stack traces and try to shed some light on this.

B. Considered that the driver could be getting corrupted by concurrency issues, I updated 
the gofer stress test over the weekend to hammer gofer/qpid using large numbers of 
threads.  Each thread creates agent proxies and invokes a lot of methods.  Inspection of 
python-qpid code clearly shows that it /should/ be reentrant.  All tests ran clean.

C. Replaced the goferd code profiler (cProfile) with YAPPI [1] in attempt to get better 
profiling information beyond the main thread.  cProfile & Profile suck for multi-threaded 
applications and you have to figure this out the hard way.  Any other suggestions 
appreciated.  Today, I'll reproduce condition #2 and at least try to see which 
thread/function is eating the CPU.

=== #2 ===

A. So far, only know about this happening on Preethi's and Sayli's machines.  I want to 
recreate on another box.
B. Re-engage QPID development team to help resolve.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=688662
[2] https://bugzilla.redhat.com/show_bug.cgi?id=689573
[3] http://code.google.com/p/yappi/
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pulp.log
URL: <http://listman.redhat.com/archives/pulp-list/attachments/20110328/55a8fa0d/attachment.log>