[rdo-list] Upgrade from Mitaka to Newton: my story

Mon Mar 13 15:10:03 UTC 2017

Dear RDO-ers,

I'm here today to tell you a story, about an incredible adventure i
had in last few weeks. Since our openstack test deployment (mitaka
based) was getting old, we decided to try an upgrade to newton. Nobody
is still using this environment seriously, so downtime was acceptable.

Our environment has been deployed first time in September 2016,
through TripleO in this way:
- 1 baremetal undercloud node with RHEL 7.2 and RDO Mitaka packages
(HP Blade Gen7)
- overcloud-full imaged built with diskimage builder,
- 6 baremetal overcloud nodes (HP Blade Gen7, 3 controllers + 3 compute)
- External Ceph backend for images/block/vms (not managed by tripleo)

I started reading the official tripleo docs and also some RHOSP 10
documentation, where i extracted the following resume of the major
upgrade workflow.

Upgrade didn't took really weeks, was an activity done during the
spare time, i think that could take about 6 hours without all the
issues about filesystem full and service crashes i had.

For the curious (aka TL;DR) the upgrade went almost well, even with
some problems that i had to fix. At the moment what was running is
still ok, but i cannot deploy new vm due to a glance misconfiguration.

I hope someone could find this experience useful, and maybe can share
their experience to help me fixing the remaining problems (better
without manual intervention on config files, obviously).

Workflow

The Overcloud upgrade process uses the following workflow:

Step 1. Run your deployment command including the
major-upgrade-ceilometer-wsgi-mitaka-newton.yaml environment file.

Had to install manually gnocchi packages because stack failed for
missing user gnocchi. I don't know why gnocchi at this point is
required, since as far as i understood this is a procedure running on
mitaka that can be run independently of the upgrade.
While restarting deploy encountered bug #1620696, had to fix with a
bad workaround (reported).

Step 2. Run your deployment command including the
major-upgrade-pacemaker-init.yaml environment file.

Had to create /root/.my.cnf with empty content, upgrade script was
checking for existence, even if the file wasn't created by mitaka's
tripleo and has been created in steps ahead.
The script major_upgrade_check.sh was searching for this file in
check_galera_root_password.
Please note that at this step galera root password was still empty.

Step 3. Run the upgrade-non-controller.sh on each Object Storage node.

Skipped, no standalone object storage was deployed

Step 4. Run your deployment command including the
major-upgrade-pacemaker.yaml and the optional
major-upgrade-remove-sahara.yaml environment file.

Had to free up some space on controllers. Nobody implemented a cron
that cleans expired keystone tokens. That db table was using  about
70GB of disk.
/usr/bin/keystone-manage token_flush was too slow, had to truncate
manually the table.
Once cleanup has been done i restarted the deployment encountering a
bug about unbound variables on a bash script. Reported bug 1667731 and
added the fix https://review.openstack.org/#/c/437959/  that has been
reviewed and merged. A better fix has been proposed then
https://review.openstack.org/#/c/439749 and is pending for
review+merge (if you want to review, you're welcome, is a trivial
fix).
Don't know why gnocchi-upgrade was called and started giving error
since was unconfigured. Gnocchi wasn't present in my mitaka setup and
had only to install it to force the migration of ceilometer to wsgi at
step 1.

Sahara hasn't been removed, we're using it.

Step 5. Run the upgrade-non-controller.sh on each Ceph Storage node.

Ceph storage nodes are external and not managed by TripleO

Step 6. Run the upgrade-non-controller.sh on each Compute node.

Went smooth. As far as i see nodes hasn't been rebooted, i need to
reboot it manually to apply kernel upgrade. But vm that were running
on that hosts are still ok.

Step 7. Run your deployment command including the
major-upgrade-pacemaker-converge.yaml environment file.

Due to gnocchi-upgrade launched manually by root on step 4, i had to
fix permissions on gnocchi-upgrade.log. This allowed to the upgrade
procedure to go on.
During the upgrade cluster has been set to maintenance, and this
required manual recover do allow the procedure to go on. Some services
has failed (galera in my case) and hasn't been restarted because of
maintenance.
The maintenance can be removed with: pcs property set
maintenance-mode=false --wait
After the maintenance mode has been disabled, services are managed and
restarted if required.

Step 8. Run your deployment command including the
major-upgrade-aodh-migration.yaml environment file.

This required some time because the procedure was quite slow. I had no
useful alarm to migrate, but forgot do make cleanup in advance. The
procedure was timing out on the migration of alarm history that was
very long.
I decided to ignore history migration, but the command line option
--migrate-histiory false was not considered. This because is a
type=bool in argparse configuration and every value i tried to set has
been ignored, always setting the variable to true. I did a quick
change on the aodh/cmd/data_migration.py to set false as default
value.
This permitted to complete the procedure, even if i had to re-execute
several times because sometimes some services (galera, mongodb,
rabbitmq) stopped due to filesystem full and steps of the procedure
stopped with errors. Reboot of the controller node where the service
was not running helped fixing.
Also in this step i found sometime the cluster set to maintenance.

Final status:

At the end of this procedure the environment seems to be ok, but i
found out that:

- glance is not ok, new vm cannot be instantiated because glance returns

[stack at tripleo ~]$ source ~/overcloudrc
[stack at tripleo ~]$ glance image-download cirros
400 Bad Request

Unknown scheme 'rbd' found in URI

(HTTP 400)

Our glance backend is on ceph, with default configs (user openstack,
pool name images).
Even calling the deploy command with the full environment (removing
obviously the yaml files that was triggering upgrade steps), this
hasn't fixed yet. Seems that the glance configuration file is not set
for managing ceph backend.

I'm going on with the checks to see if everything is ok, but at the
moment it seems the only issue.

Hope you found it useful,

Luca

-- 
"E' assurdo impiegare gli uomini di intelligenza eccellente per fare
calcoli che potrebbero essere affidati a chiunque se si usassero delle
macchine"
Gottfried Wilhelm von Leibnitz, Filosofo e Matematico (1646-1716)

"Internet è la più grande biblioteca del mondo.
Ma il problema è che i libri sono tutti sparsi sul pavimento"
John Allen Paulos, Matematico (1945-vivente)

Luca 'remix_tj' Lorenzetto, http://www.remixtj.net , <lorenzetto.luca at gmail.com>