aboutsummaryrefslogtreecommitdiffstats
path: root/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh
AgeCommit message (Collapse)AuthorFilesLines
2016-11-15Replace ceilometer-dbsync by ceilometer-upgradeSteven Hardy1-1/+1
https://review.openstack.org/#/c/388688/ has removed ceilometer-dbsync so ceilometer-upgrade must be used instead. Additionally, ceilometer-dbsync enabled option --skip-gnocchi-resource-types and ceilometer-upgrade doesn't, so i'm setting it by default to ensure backwards compatibility. Note this is based on the corresponding fix to puppet-ceilometer ref https://review.openstack.org/#/c/396570 Change-Id: Ic0a15c75d1cd3e3f70eeafd9ba09d50c58cc1293 Closes-Bug: #1641076
2016-11-09Fix race during major-upgrade-pacemaker stepMichele Baldessari1-9/+60
Currently when we call the major-upgrade step we do the following: """ ... if [[ -n $(is_bootstrap_node) ]]; then check_clean_cluster fi ... if [[ -n $(is_bootstrap_node) ]]; then migrate_full_to_ng_ha fi ... for service in $(services_to_migrate); do manage_systemd_service stop "${service%%-clone}" ... done """ The problem with the above code is that it is open to the following race condition: 1. Code gets run first on a non-bootstrap controller node so we start stopping a bunch of services 2. Pacemaker notices will notice that services are down and will mark the service as stopped 3. Code gets run on the bootstrap node (controller-0) and the check_clean_cluster function will fail and exit 4. Eventually also the script on the non-bootstrap controller node will timeout and exit because the cluster never shut down (it never actually started the shutdown because we failed at 3) Let's make sure we first only call the HA NG migration step as a separate heat step. Only afterwards we start shutting down the systemd services on all nodes. We also need to move the STONITH_STATE variable into a file because it is being used across two different scripts (1 and 2) and we need to store that state. Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com> Closes-Bug: #1640407 Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
2016-11-01Rework gnocchi-upgrade to run in a separate upgrade stepPradeep Kilambi1-12/+3
gnocchi when configured with swift will require keystone to be available to authenticate to migrate to v3. At this step keystone is not available and gnocchi upgrade fails with auth error. Instead start apache in step 3, start apache first and then run gnocchi upgrade in a separate step and let upgrade happen here. Closes-Bug: #1634897 Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0
2016-10-10Actually start the systemd services in step3 of the major-upgrade stepMichele Baldessari1-1/+1
We have the following function in the upgrade process after we updated the packages and called the db-sync commands: services=$(services_to_migrate) ... for service in $(services); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done The above is broken because $services contains a list of services to start, so $(services) will return gibberish and the for loop will never execute anything. One of the symptoms for this is the openstack-nova-compute service not restarting on the compute nodes during the yum -y upgrade. The reason for this is that during the service restart, nova-compute waits for nova-conductor to show up in the rabbitmq queues, which cannot happen since the service was actually never started. Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3 Closes-Bug: #1630580
2016-10-05Adds Environment File for Removing Sahara during M/N upgrademarios1-1/+5
The default path if the operator does nothing is to keep the sahara services on mitaka to newton upgrades. If the operator wishes to remove sahara services then they need to specify the provided major-upgrade-remove-sahara.yaml environment file in the stack upgrade commands. The existing migration to ha arch already removes the constraints and pcs resource for sahara api/engine so we just need to stop it from starting again if we want to remove it. This adds a KeepSaharaServiceOnUpgrade parameter to determine if Sahara is disabled from starting up after the controllers are upgraded (defaults true). Finally it is worth noting that we default the sahara services as 'on' during converge here in the resource_registry of the converge environment file; any subsequent stack updates where the deployment contains sahara services will need to include the -e /environments/services/sahara.yaml environment file. Related-Bug: 1630247 Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
2016-09-29Fix races in major-upgrade-pacemaker Step2Michele Baldessari1-0/+22
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status |grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965