[rdo-list] [TripleO] Newton large baremetal deployment issues

Mon Nov 14 00:20:30 UTC 2016

On 12/11/16 02:48, Charles Short wrote:
> Update -
> 
> I updated Undercloud to latest stable Newton release, and used the
> provided CentOS Overcloud images. I first completed a small test
> deployment no problem (3 controller 3 compute) .
> I then deployed again with a larger environment (40 compute, 3
> controllers).
> When the nodes were up and ACTIVE/pingable early in deployment I checked
> the hosts files. This time no formatting errors.
> 
> However during deployment there were lots of long pauses and I noticed
> plenty of these sorts of messages in the nova logs during the pauses -
> 
> /var/log/nova/nova-compute.log:2016-11-11 19:56:07.322 7840 ERROR
> nova.compute.manager [instance:
> f6fd4127-fc94-4a36-9b3b-5e4f21bd08ed]     raise
> exceptions.ConnectTimeout(msg)
> /var/log/nova/nova-compute.log:2016-11-11 19:56:07.322 7840 ERROR
> nova.compute.manager [instance: f6fd4127-fc94-4a36-9b3b-5e4f21bd08ed]
> ConnectTimeout: Request to
> http://192.168.0.1:9696/v2.0/ports.json?tenant_id=30401f505075414fbd700f028412977f&device_id=f6fd4127-fc94-4a36-9b3b-5e4f21bd08ed
> timed out
> 
> While this was happening I could not use nova from the Undercloud at all-
> 
> source stackrc
> nova list
> ERROR (ClientException): The server has either erred or is incapable of
> performing the requested operation. (HTTP 500) (Request-ID:
> req-367c9ac2-6f27-4e71-a451-681c8c3d2ce5)
> 
> After 2 hours of deployment and only on Step 1 of 5 the deployment fails
> with -
> 
> ERROR: Timed out waiting for a reply to message ID
> 971157a211e549998bb7a6f6e494688b
> 
> note - I have a timeout value way over 2 hours in the deployment
> commands (2000)
> 
> Post failed deployment I still cannot use nova. Looks like the
> Undercloud is very unhappy (same error as above)
> 
> The only way I can get the Undercloud working again is to restart all
> services (restarting nova alone does not work)
> sudo systemctl restart neutron*
> sudo systemctl restart openstack*
> 
> I think I may try OSP9 as I am running out of ideas. Either that or
> giving Openstack-Ansible a try.....
> 
> 
> 
> Charles

So the symptoms you are showing me above almost definitely leads me to
believe that neutron-server failed on the undercloud, which would
explain why the deploy and nova failed to work. It could have failed
before or during the deploy. We regularly see instances where
neutron-server times out upon system boot (takes slightly longer to
start than systemd expects), so we need to start it manually.

To be clear, The undercloud has been installed using this repo

http://buildlogs.centos.org/centos/7/cloud/x86_64/rdo-trunk-newton-tested/

Which overcloud images are you using? I'm not seeing any provided in
that repo, and I just want to make sure the undercloud and overcloud
packages match (as the tripleo-heat-templates package on the undercloud
has to align with the openstack-puppet-modules package on the overcloud
iamges).

Also, is it possible to get a copy of all the neutron-server log from
the undercloud? If we can understand why neutron-server failed, that is
the first step towards getting a working deployment.

It would be great if we could get a full sosreport with all the system
logs, to check for other errors. I'm assuming there were no problems
with the 'openstack undercloud install' process?

Regards,

Graeme

> 
> 
> On 10/11/2016 08:20, Charles Short wrote:
>> Hi,
>>
>> Deploy command here
>>
>> http://pastebin.com/xNZXTWPE
>>
>> no output from rpm command.
>>
>> Yes re OSP9 images I was just interested how they behaved early on in
>> the deployment before any puppet errors (cloud init etc).
>> Not a good test, just morbid fascination out of desperation.
>>
>> No Windows involved, and I have not altered the main puppet template 
>> directory at all.
>>
>> I am going to try and update the Undercloud to the latest stable, use
>> the provided images and see how that goes.
>>
>> If all else fails I will install OSP9 and consider myself exhausted
>> from all the swimming upstream ;)
>>
>> Charles
>>
>> On 10/11/2016 03:13, Graeme Gillies wrote:
>>> On 10/11/16 02:18, Charles Short wrote:
>>>> Hi,
>>>>
>>>> Just some feedback on this thread.
>>>>
>>>> I have redeployed several times and have begun to suspect DNS as being
>>>> the cause for delays (just a guess as the deployment always competes
>>>> with no obvious errors)
>>>> I had a look at the local hosts files on the nodes during deployment
>>>> and
>>>> can see that lots of them (not all) are incorrectly formatted as they
>>>> contain '\n'.
>>>>
>>>> For example a small part of one hosts file -
>>>> <<
>>>> \n10.0.7.30 overcloud-novacompute-32.localdomain
>>>> overcloud-novacompute-32
>>>> 192.168.0.39 overcloud-novacompute-32.external.localdomain
>>>> overcloud-novacompute-32.external
>>>> 10.0.7.30 overcloud-novacompute-32.internalapi.localdomain
>>>> overcloud-novacompute-32.internalapi
>>>> 10.35.5.67 overcloud-novacompute-32.storage.localdomain
>>>> overcloud-novacompute-32.storage
>>>> 192.168.0.39 overcloud-novacompute-32.storagemgmt.localdomain
>>>> overcloud-novacompute-32.storagemgmt
>>>> 10.0.8.39 overcloud-novacompute-32.tenant.localdomain
>>>> overcloud-novacompute-32.tenant
>>>> 192.168.0.39 overcloud-novacompute-32.management.localdomain
>>>> overcloud-novacompute-32.management
>>>> 192.168.0.39 overcloud-novacompute-32.ctlplane.localdomain
>>>> overcloud-novacompute-32.ctlplane
>>>> \n10.0.7.21 overcloud-novacompute-33.localdomain
>>>> overcloud-novacompute-33
>>>> I wondered if maybe the image I was using was the issue so I tried the
>>>> RH OSP9 official image -  Same hosts file formatting issues in
>>>> deployment.
>>>> Maybe a workaround would be to change nsswitch.conf in the image to
>>>> look
>>>> up from DNS first  -  my Undercloud dnsmasq server - and have this
>>>> populated with the correct entries from a node (once all nodes are
>>>> pingable).
>>>>
>>>> Charles
>>> Hi Charles,
>>>
>>> If you are getting formatting issues in /etc/hosts, it's possible that
>>> the templates directory you are using might have problems, especially if
>>> it's been edited on windows machines. Are you using unmodified templates
>>> from /usr/share/openstack-tripleo-heat-templates? Also note that RHOS 9
>>> images will not match RDO Newton templates, as RHOS 9 is mitaka, and
>>> overcloud images contain puppet modules which must sync with the
>>> templates used on the undercloud.
>>>
>>> If you are using the templates in
>>> /usr/share/openstack-tripleo-heat-templates, can you give the output (if
>>> any) from
>>>
>>> rpm -V openstack-tripleo-heat-templates
>>>
>>> Also perhaps getting a copy of your full overcloud deploy command will
>>> help shed some light as well.
>>>
>>> Thanks in advance,
>>>
>>> Graeme
>>>
>>>> On 06/11/2016 23:25, Graeme Gillies wrote:
>>>>> Hi Charles,
>>>>>
>>>>> This definitely looks a bit strange to me, as we do deploys around 42
>>>>> nodes and it takes around 2 hours to do so, similar to your setup (1G
>>>>> link for provisoning, bonded 10G for everything else).
>>>>>
>>>>> Would it be possible for you to run an sosreport on your undercloud
>>>>> and
>>>>> provide it somewhere (if you are comfortable doing so). Also, can you
>>>>> show us the output of
>>>>>
>>>>> openstack stack list --nested
>>>>>
>>>>> And most importantly, if we can get a fully copy of the output of the
>>>>> overcloud deploy command, that has timestamps against when ever
>>>>> stack is
>>>>> created/finished, so we can try and narrow down where all the time is
>>>>> being spent.
>>>>>
>>>>> You note that you have quite a powerful undercloud (294GB of Memory
>>>>> and
>>>>> 64 cpus), and we have had issues in the past with very powerful
>>>>> underclouds, because the Openstack components try and tune themselves
>>>>> around the hardware they are running on and get it wrong for bigger
>>>>> servers.
>>>>>
>>>>> Are we able to get an output from "sar" or some other tool you are
>>>>> using
>>>>> to track cpu and memory usage during the deployment? I'd like to check
>>>>> those values look sane.
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Graeme
>>>>>
>>>>> On 05/11/16 01:31, Charles Short wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Each node has 2X HP 900GB 12G SAS 10K 2.5in SC ENT HDD.
>>>>>> The 1Gb deployment NIC is not really causing the delay. It is very
>>>>>> busy
>>>>>> for the time the overcloud image is rolled out (the first 30 to 45
>>>>>> mins
>>>>>> of deployment), but after that  (once all the nodes are up and active
>>>>>> with an ip address (pingable)) ,the bandwidth is a fraction of
>>>>>> 1Gbps on
>>>>>> average for the rest of the deployment. For info the NICS in the
>>>>>> nodes
>>>>>> for the Overcloud networks are dual bonded 10Gbit.
>>>>>>
>>>>>> The deployment I mentioned before (50 nodes) actually completed in 8
>>>>>> hours (which is double the time it took for 35 nodes!)
>>>>>>
>>>>>> I am in the process of a new  3 controller 59 compute node deployment
>>>>>> pinning all the nodes as you suggested. The initial overcloud
>>>>>> image roll
>>>>>> out took just under 1 hour (all nodes ACTIVE and pingable). I am
>>>>>> now 45
>>>>>> hours in and all is running (slowly). It is currently on Step2  (of 5
>>>>>> Steps). I would expect this deployment to take 10 hours on current
>>>>>> speed.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Charles
>>>>>>
>>>>>> On 04/11/2016 15:17, Justin Kilpatrick wrote:
>>>>>>> Hey Charles,
>>>>>>>
>>>>>>> What sort of issues are you seeing now? How did node pinning work
>>>>>>> out
>>>>>>> and did a slow scale up present any more problems?
>>>>>>>
>>>>>>> Deployments tend to be disk and network limited, you don't mention
>>>>>>> what sort of disks your machines have but you do note 1g nics, which
>>>>>>> are doable but might require some timeout adjustments or other
>>>>>>> considerations to give everything time to complete.
>>>>>>>
>>>>>>> On Fri, Nov 4, 2016 at 10:45 AM, Charles Short <cems at ebi.ac.uk
>>>>>>> <mailto:cems at ebi.ac.uk>> wrote:
>>>>>>>
>>>>>>>       Hi,
>>>>>>>
>>>>>>>       So you are implying that tripleO is not really currently
>>>>>>> able to
>>>>>>>       roll out large deployments easily as it is is prone to scaling
>>>>>>>       delays/errors?
>>>>>>>       Is the same true for RH OSP9 (out of the box) as this also
>>>>>>> uses
>>>>>>>       tripleO?  I would expect exactly the same scaling issues. But
>>>>>>>       surely OSP9 is designed for large enterprise Openstack
>>>>>>> installations?
>>>>>>>       So if OSP9 does work well with large deployments, what are the
>>>>>>>       tripleO tweaks that make this work (if any)?
>>>>>>>
>>>>>>>       Many Thanks
>>>>>>>
>>>>>>>       Charles
>>>>>>>
>>>>>>>       On 03/11/2016 13:30, Justin Kilpatrick wrote:
>>>>>>>>       Hey Charles,
>>>>>>>>
>>>>>>>>       If you want to deploy a large number of machines, I
>>>>>>>> suggest you
>>>>>>>>       deploy a small configuration (maybe 3 controllers 1
>>>>>>>> compute) and
>>>>>>>>       then run the overcloud deploy command again with 2
>>>>>>>> computes, so
>>>>>>>>       on and so forth until you reach your full allocation
>>>>>>>>
>>>>>>>>       Realistically you can probably do a stride of 5 computes each
>>>>>>>>       time, experiment with it a bit, as you get up to the full
>>>>>>>>       allocation of nodes you might run into a race condition
>>>>>>>> bug with
>>>>>>>>       assigning computes to nodes and need to pin nodes (pinning is
>>>>>>>>       adding as an ironic property that overcloud-novacompute-0
>>>>>>>> goes
>>>>>>>>       here, 1 here, so on and so forth).
>>>>>>>>
>>>>>>>>       As for actually solving the deployment issues at scale
>>>>>>>> (instead
>>>>>>>>       of this horrible hack) I'm looking into adding some
>>>>>>>> robustness at
>>>>>>>>       the ironic or tripleo level to these operations. It sounds
>>>>>>>> like
>>>>>>>>       you're running more into node assignment issues rather
>>>>>>>> than pxe
>>>>>>>>       issues though.
>>>>>>>>
>>>>>>>>       2016-11-03 9:16 GMT-04:00 Luca 'remix_tj' Lorenzetto
>>>>>>>>       <lorenzetto.luca at gmail.com
>>>>>>>> <mailto:lorenzetto.luca at gmail.com>>:
>>>>>>>>
>>>>>>>>           On Wed, Nov 2, 2016 at 8:30 PM, Charles Short
>>>>>>>> <cems at ebi.acuk
>>>>>>>>           <mailto:cems at ebi.ac.uk>> wrote:
>>>>>>>>           > Some more testing of different amounts of nodes vs time
>>>>>>>>           taken for successful
>>>>>>>>           > deployments -
>>>>>>>>           >
>>>>>>>>           > 3 controller 3 compute = 1 hour
>>>>>>>>           > 3 controller 15 compute = 1 hour
>>>>>>>>           > 3 controller 25 compute  = 1 hour 45 mins
>>>>>>>>           > 3 controller 35 compute  = 4 hours
>>>>>>>>
>>>>>>>>           Hello,
>>>>>>>>
>>>>>>>>           i'm now preparing my deployment of 3+2 nodes. I'll check
>>>>>>>> what you
>>>>>>>>           reported and give you some feedback.
>>>>>>>>
>>>>>>>>           Luca
>>>>>>>>
>>>>>>>>
>>>>>>>>           --
>>>>>>>>           "E' assurdo impiegare gli uomini di intelligenza
>>>>>>>> eccellente
>>>>>>>>           per fare
>>>>>>>>           calcoli che potrebbero essere affidati a chiunque se si
>>>>>>>>           usassero delle
>>>>>>>>           macchine"
>>>>>>>>           Gottfried Wilhelm von Leibnitz, Filosofo e Matematico
>>>>>>>> (1646-1716)
>>>>>>>>
>>>>>>>>           "Internet è la più grande biblioteca del mondo.
>>>>>>>>           Ma il problema è che i libri sono tutti sparsi sul
>>>>>>>> pavimento"
>>>>>>>>           John Allen Paulos, Matematico (1945-vivente)
>>>>>>>>
>>>>>>>>           Luca 'remix_tj' Lorenzetto, http://www.remixtj.net ,
>>>>>>>>           <lorenzetto.luca at gmail.com
>>>>>>>> <mailto:lorenzetto.luca at gmail.com>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>>           rdo-list mailing list
>>>>>>>>           rdo-list at redhat.com <mailto:rdo-list at redhat.com>
>>>>>>>> https://www.redhat.com/mailman/listinfo/rdo-list
>>>>>>>> <https://www.redhat.com/mailman/listinfo/rdo-list>
>>>>>>>>
>>>>>>>>           To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>>>>>> <mailto:rdo-list-unsubscribe at redhat.com>
>>>>>>>>
>>>>>>>>
>>>>>>>       --
>>>>>>>       Charles Short
>>>>>>>       Cloud Engineer
>>>>>>>       Virtualization and Cloud Team
>>>>>>>       European Bioinformatics Institute (EMBL-EBI)
>>>>>>>       Tel: +44 (0)1223 494205 <tel:%2B44%20%280%291223%20494205>
>>>>>>>
>>>>>>>
>>>>>> -- 
>>>>>> Charles Short
>>>>>> Cloud Engineer
>>>>>> Virtualization and Cloud Team
>>>>>> European Bioinformatics Institute (EMBL-EBI)
>>>>>> Tel: +44 (0)1223 494205
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> rdo-list mailing list
>>>>>> rdo-list at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/rdo-list
>>>>>>
>>>>>> To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>>>>
>>>
>>
> 

-- 
Graeme Gillies
Principal Systems Administrator
Openstack Infrastructure
Red Hat Australia