[Rdo-list] [rdo-manager] - frustrating node

Thu Jan 28 13:32:06 UTC 2016

On 01/28/2016 07:12 AM, Mohammed Arafa wrote:
> hi all
> 
> i am attempting to build a 2 node basic overcloud. my previous emails have
> been talking about the problems i encountered.
> 
> what i have :
> - 1 vm called rdo with undercloud AND overcloud. this one is has not been
> updated since november and i keep restoring snapshots to that date.
> - a 2nd vm called rdo2, full updated, overcloud fails to deploy to a
> specific physical node
> 
> observations: (unscientific!)
> the 2 physical nodes are both good. i tested by redeploying on rdo again
> and again. i even swapped their order in instackenv.json and redeploying
> succesfully from instackenv.json step.
> however i have a particular machine that refuses to deploy. it doesnt
> matter what order. if it is the controller, it fails, if it is the compute
> it fails.
> i am using the same flavour on both rdo vms. but again, i believe i have
> ruled out that variable.
> 
> how far did i reach?
> over the past few days i have opened the console and watched this
> particular machine pxe boot, get an ip, reboot, change its hostname to
> reflect the ip, reboot to localhost.localdomain (?) and the power off. i am
> not saying i sat down and watched it for the entire 209 minutes but i have
> observed it unscientifically
> 
> last error:
> Deploying templates in the directory
> /usr/share/openstack-tripleo-heat-templates
> Stack failed with status: Resource CREATE failed: resources.Controller:
> ResourceInError: resources[0].resources.Controller: Went to status ERROR
> due to "Message: No valid host was found. There are not enough hosts
> available., Code: 500"
> Heat Stack create failed.
> 
> real    209m19.252s
> user    0m21.695s
> sys     0m2.402s
> 
> what am i looking for?:
> what do i look for in the logs? and my logs are huge; they dont get rotated
> for some reason
> i would like to know the reason this particular physical machine refuses to
> deploy, so i can fix it. i believe i have eliminated all variables except
> the machine itself and it has me puzzled and frustrated as i need to move
> on to the next stage of network isolation.
> 
> any ideas?
> 

The point at which it is failing seems to be before the node is fully
deployed. Which is to say, before we start doing puppet applies on it to
configure it.

This is a helpful distinction, because we can limit the search space for
possible issues. This is almost certainly a Nova/Ironic issue. The best
log to look at for Nova in this case would be the scheduler log at
/var/log/nova/nova-scheduler.log, while the best log to look at for
Ironic would be the conductor log at /var/log/ironic/ironic-conductor.log.

If your logs are very large, it may be better to delete them, and
reproduce the issue in order to further limit the search space. Note,
that the issue is most likely reproduced within the first 30min of that
test, so you wont need to wait for the full 200+ which I am guessing
just hits the deploy timeout.

> thanks
> 
> 
> 
> _______________________________________________
> Rdo-list mailing list
> Rdo-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rdo-list
> 
> To unsubscribe: rdo-list-unsubscribe at redhat.com
>