summaryrefslogtreecommitdiffstats
path: root/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh
AgeCommit message (Collapse)AuthorFilesLines
2016-10-05Adds Environment File for Removing Sahara during M/N upgrademarios1-1/+5
The default path if the operator does nothing is to keep the sahara services on mitaka to newton upgrades. If the operator wishes to remove sahara services then they need to specify the provided major-upgrade-remove-sahara.yaml environment file in the stack upgrade commands. The existing migration to ha arch already removes the constraints and pcs resource for sahara api/engine so we just need to stop it from starting again if we want to remove it. This adds a KeepSaharaServiceOnUpgrade parameter to determine if Sahara is disabled from starting up after the controllers are upgraded (defaults true). Finally it is worth noting that we default the sahara services as 'on' during converge here in the resource_registry of the converge environment file; any subsequent stack updates where the deployment contains sahara services will need to include the -e /environments/services/sahara.yaml environment file. Related-Bug: 1630247 Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
2016-09-29Fix races in major-upgrade-pacemaker Step2Michele Baldessari1-0/+22
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status |grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965