[rhos-list] Problems with quantum and dhcp-agent

Gary Kotton gkotton at redhat.com
Thu Jun 6 11:21:29 UTC 2013


On 06/06/2013 11:29 AM, Gary Kotton wrote:
> On 06/05/2013 11:14 PM, Steven Dake wrote:
>> On 06/05/2013 12:14 AM, Gary Kotton wrote:
>>> On 06/05/2013 01:17 AM, Steven Dake wrote:
>>>> On 06/04/2013 01:09 PM, S Manoo wrote:
>>>>> Looking into this further, I'm observing the same error message 
>>>>> relating to timeouts talking to qpid in dhcp-agent.log after every 
>>>>> restart, perhaps this is why I'm unable to get any dhcp responses 
>>>>> to instances? Any suggestions on what's causing this and where I 
>>>>> might look to troubleshoot this further?
>>>
>>> When one restarts a host each process needs to register with the 
>>> message broker. If you are running all of the services on the same 
>>> host then they will only be able to connect when the qpid service is 
>>> up and running. This usually takes a few seconds after reboot. If a 
>>> service does not receive an answer from the qpid service then it 
>>> will wait and retry again. This is why you see the timeouts. The 
>>> wait is incremental. I have seen that all service are usually able 
>>> to connect within a minute of booting a host (we should try and 
>>> reduce this time).
>>>
>>> Please note that the quantum cli has an option: quantum agent-list. 
>>> This provides the list of agents, their status and hosts that they 
>>> are running on.
>>>
>>> If you spin up an instance after the dhcp agent is up and running do 
>>> you see the problem?
>>>
>>>>>
>>>> S Manoo,
>>>>
>>>> We may have just fixed a bug related to this problem which is not 
>>>> fixed in the preview.  Please try the workaround in this bugzilla:
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=970453
>>>
>>> This fix is good for an all in one setup but will not help if the 
>>> DHCP agent is running on another host. In Quantum we have the notion 
>>> of a network node. Please look at 
>>> https://docs.google.com/drawings/d/167gegaoTBZpd318b2JTgF_Qi9YdkIX8pcQ6YBJLUtGY/edit?usp=sharing
>>>
>>> If the message broker goes down (say for example host reboot or 
>>> network problems) then the dhcp agent will try and reconnect.
>>>
>> Gary,
>>
>> I have found dhcp agent stops responding permanently in this 
>> condition on a all in one setup.  Perhaps the same is true for 
>> multinode (ie the retry logic doesn't work as expected).  I don't 
>> have multiple nodes to test, but might be worth double-checking if 
>> you do.
>
> I have done the following check (on an all in one setup):
> 1. reboot host
> 2. stop quantum service
> 3. check that dhcp agent has a timeout with the quantum service
> 4. restart quantum service
>
> I see a number of issues which I am going to investigate:
>
> 1. The agent is up:
> [root at dhcp-4-126 ~(keystone_admin)]# quantum agent-list
> +--------------------------------------+--------------------+---------------------------+-------+----------------+
> | id                                   | agent_type         | 
> host                      | alive | admin_state_up |
> +--------------------------------------+--------------------+---------------------------+-------+----------------+
> | 11e35126-6c07-4a2f-b681-399cdbc8210d | L3 agent           | 
> dhcp-4-126.tlv.redhat.com | :-)   | True           |
> | 5e75d5d9-edb0-462e-850b-013ad7a518f4 | DHCP agent         | 
> dhcp-4-126.tlv.redhat.com | :-)   | True           |
> | af51a50f-f45e-4736-8517-ed3cda759b3c | Open vSwitch agent | 
> dhcp-4-126.tlv.redhat.com | :-)   | True           |
> | d7588cb1-b287-4b4c-a8a8-539d4c5129b2 | Open vSwitch agent | 
> dhcp-4-227.tlv.redhat.com | :-)   | True           |
> +--------------------------------------+--------------------+---------------------------+-------+----------------+
> [root at dhcp-4-126 ~(keystone_admin)]#
> This means the the agent successfully sent a message to the plugin.
>
> 2. In the DHCP log there are timeouts with the qpid service and no 
> notification of a resync (which used to happen in Folsom.
>
> I am on it. And will post on any progress.
>
> A few days ago I had problem with nova compute which seemed similar to 
> this. I hope that it is not an issue with qpid and solely related to 
> the dhcp agent (which is easier for me to address)

I have found the problem and am pushing patches.

Thanks
Gary

>
>>
>> Regards
>> -steve
>>
>>>>
>>>> Regards
>>>> -steve
>>>>
>>>>
>>>>> */var/log/quantum/dhcp-agent.log:*
>>>>> 2013-06-04 12:50:44     INFO [quantum.common.config] Logging enabled!
>>>>> 2013-06-04 12:50:44     INFO 
>>>>> [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server 
>>>>> on localhost:5672
>>>>> 2013-06-04 12:50:44     INFO 
>>>>> [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server 
>>>>> on localhost:5672
>>>>> 2013-06-04 12:50:44     INFO [quantum.agent.dhcp_agent] DHCP agent 
>>>>> started
>>>>> 2013-06-04 12:51:44    ERROR [quantum.agent.dhcp_agent] Failed 
>>>>> reporting state!
>>>>> Traceback (most recent call last):
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/agent/dhcp_agent.py", 
>>>>> line 700, in _report_state
>>>>>     self.agent_state)
>>>>>   File "/usr/lib/python2.6/site-packages/quantum/agent/rpc.py", 
>>>>> line 66, in report_state
>>>>>     topic=self.topic)
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/proxy.py", 
>>>>> line 80, in call
>>>>>     return rpc.call(context, self._get_topic(topic), msg, timeout)
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/__init__.py", 
>>>>> line 140, in call
>>>>>     return _get_impl().call(CONF, context, topic, msg, timeout)
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", 
>>>>> line 611, in call
>>>>>     rpc_amqp.get_connection_pool(conf, Connection))
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", 
>>>>> line 613, in call
>>>>>     rv = list(rv)
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", 
>>>>> line 555, in __iter__
>>>>>     self.done()
>>>>>   File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
>>>>>     self.gen.next()
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", 
>>>>> line 552, in __iter__
>>>>>     self._iterator.next()
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", 
>>>>> line 436, in iterconsume
>>>>>     yield self.ensure(_error_callback, _consume)
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", 
>>>>> line 380, in ensure
>>>>>     error_callback(e)
>>>>>   File 
>>>>> "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", 
>>>>> line 421, in _error_callback
>>>>>     raise rpc_common.Timeout()
>>>>> Timeout: Timeout while waiting on RPC response.
>>>>> 2013-06-04 12:51:44  WARNING 
>>>>> [quantum.openstack.common.loopingcall] task run outlasted interval 
>>>>> by 56.108887 sec
>>>>> 2013-06-04 12:51:44     INFO [quantum.agent.dhcp_agent] 
>>>>> Synchronizing state
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 3, 2013 at 11:28 PM, S Manoo <smanoo76 at gmail.com 
>>>>> <mailto:smanoo76 at gmail.com>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>     *dhcp-agent.log:*
>>>>>     [root at grizzly ~(keystone_admin)]# cat dhcp-agent.log
>>>>>     2013-06-03 22:27:09     INFO [quantum.common.config] Logging
>>>>>     enabled!
>>>>>     2013-06-03 22:27:09     INFO
>>>>>     [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP
>>>>>     server on 10.0.0.19:5672 <http://10.0.0.19:5672>
>>>>>     2013-06-03 22:27:09     INFO
>>>>>     [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP
>>>>>     server on 10.0.0.19:5672 <http://10.0.0.19:5672>
>>>>>     2013-06-03 22:27:10     INFO [quantum.agent.dhcp_agent] DHCP
>>>>>     agent started
>>>>>     2013-06-03 22:28:10    ERROR [quantum.agent.dhcp_agent] Failed
>>>>>     reporting state!
>>>>>     Traceback (most recent call last):
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/agent/dhcp_agent.py", line
>>>>>     700, in _report_state
>>>>>         self.agent_state)
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/agent/rpc.py", line
>>>>>     66, in report_state
>>>>>         topic=self.topic)
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/proxy.py",
>>>>>     line 80, in call
>>>>>         return rpc.call(context, self._get_topic(topic), msg, timeout)
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/__init__.py",
>>>>>     line 140, in call
>>>>>         return _get_impl().call(CONF, context, topic, msg, timeout)
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py",
>>>>>     line 611, in call
>>>>>         rpc_amqp.get_connection_pool(conf, Connection))
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py",
>>>>>     line 613, in call
>>>>>         rv = list(rv)
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py",
>>>>>     line 555, in __iter__
>>>>>         self.done()
>>>>>       File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
>>>>>         self.gen.next()
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py",
>>>>>     line 552, in __iter__
>>>>>         self._iterator.next()
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py",
>>>>>     line 436, in iterconsume
>>>>>         yield self.ensure(_error_callback, _consume)
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py",
>>>>>     line 380, in ensure
>>>>>         error_callback(e)
>>>>>       File
>>>>>     "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py",
>>>>>     line 421, in _error_callback
>>>>>         raise rpc_common.Timeout()
>>>>>     Timeout: Timeout while waiting on RPC response.
>>>>>     2013-06-03 22:28:10  WARNING
>>>>>     [quantum.openstack.common.loopingcall] task run outlasted
>>>>>     interval by 56.133099 sec
>>>>>     2013-06-03 22:28:10     INFO [quantum.agent.dhcp_agent]
>>>>>     Synchronizing state
>>>>>     [root at grizzly ~(keystone_admin)]#
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> rhos-list mailing list
>>>>> rhos-list at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/rhos-list
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> rhos-list mailing list
>>>> rhos-list at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/rhos-list
>>>
>>
>
>
>
> _______________________________________________
> rhos-list mailing list
> rhos-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhos-list

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhos-list/attachments/20130606/a92b0a16/attachment.htm>


More information about the rhos-list mailing list