[rdo-list] [TripleO] Newton large baremetal deployment issues

Wed Nov 2 19:30:17 UTC 2016

Some more testing of different amounts of nodes vs time taken for 
successful deployments -

3 controller 3 compute = 1 hour
3 controller 15 compute = 1 hour
3 controller 25 compute  = 1 hour 45 mins
3 controller 35 compute  = 4 hours

Charles

On 02/11/2016 09:44, Charles Short wrote:
> Hi,
>
> I am running TripleO Newton stable release and am deploying on 
> baremetal with CentOS.
> I have 64 nodes, and the Undercloud has plenty of resource as it is 
> one of the nodes with 294 GB Memory and 64 CPUs.
> The provisioning network is 1Gbps
>
> I have tried tuning the Undercloud using this tuning section in 10.7 
> as a guide
>
> https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/director-installation-and-usage/chapter-10-troubleshooting-director-issues 
>
>
> My Undercloud passes validations in Clapper
>
> https://github.com/rthallisey/clapper
>
> I am deploying with Network Isolation and 3 Controllers in HA.
>
> If I create a stack with 3 Controllers and 3 compute nodes this takes 
> about 1 hour
> If I create a stack with 3 Controllers and 15 compute nodes this takes 
> about 1 hour
> Both stacks pass Clapper validations.
>
> During deployment I can see that the first 20 to 30 mins is using all 
> the bandwidth available for the overcloud image deployment and them 
> uses hardly any bandwidth whilst the rest of the configuration takes 
> place.
>
> So I try a stack with 40 nodes. This is where I have issues.
> I set the timeout to 4 hours and leave it over night to deploy.
> It seems to timeout and fail to deploy due to the timeout every time.
>
> During the 40 node deployment the overcloud image is distributed in 
> about 45 mins to all nodes and the all nodes appear ACTIVE and have an 
> IP address on the deployment network.
> So it would appear that the rest of the low bandwidth configuration is 
> taking well over 3 hours to complete. This seems excessive
> I have configured nova.conf for deployment concurrency (from the 
> tuning link above) and configured the heat.conf 'num_engine_workers' 
> to be 32 taking in to account this bug
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1370516
>
> So my question is how do I tune my Undercloud to speed up the deployment?
>
> Looking at htop during deployment I can see heat is using many CPUs, 
> but the work pattern is NOT distributed. What typically happens is all 
> the CPUs are at 0 to 1 % used apart from one which is at 50 to 100%. 
> This one CPU id  changes regularly, but there is no concurrent 
> distributed workload across all the CPUs that the heat processes are 
> running on. Is heat really multi-threaded, or does if have limitations 
> so it can only really do proper work on one CPU at a time (which I am 
> seeing in htop)?
>
> Thanks
>
> Charles
>
>
>

-- 
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205