aboutsummaryrefslogtreecommitdiffstats
path: root/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh
AgeCommit message (Collapse)AuthorFilesLines
2016-11-01Rework gnocchi-upgrade to run in a separate upgrade stepPradeep Kilambi1-12/+3
gnocchi when configured with swift will require keystone to be available to authenticate to migrate to v3. At this step keystone is not available and gnocchi upgrade fails with auth error. Instead start apache in step 3, start apache first and then run gnocchi upgrade in a separate step and let upgrade happen here. Closes-Bug: #1634897 Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0
2016-10-10Actually start the systemd services in step3 of the major-upgrade stepMichele Baldessari1-1/+1
We have the following function in the upgrade process after we updated the packages and called the db-sync commands: services=$(services_to_migrate) ... for service in $(services); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done The above is broken because $services contains a list of services to start, so $(services) will return gibberish and the for loop will never execute anything. One of the symptoms for this is the openstack-nova-compute service not restarting on the compute nodes during the yum -y upgrade. The reason for this is that during the service restart, nova-compute waits for nova-conductor to show up in the rabbitmq queues, which cannot happen since the service was actually never started. Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3 Closes-Bug: #1630580
2016-10-05Adds Environment File for Removing Sahara during M/N upgrademarios1-1/+5
The default path if the operator does nothing is to keep the sahara services on mitaka to newton upgrades. If the operator wishes to remove sahara services then they need to specify the provided major-upgrade-remove-sahara.yaml environment file in the stack upgrade commands. The existing migration to ha arch already removes the constraints and pcs resource for sahara api/engine so we just need to stop it from starting again if we want to remove it. This adds a KeepSaharaServiceOnUpgrade parameter to determine if Sahara is disabled from starting up after the controllers are upgraded (defaults true). Finally it is worth noting that we default the sahara services as 'on' during converge here in the resource_registry of the converge environment file; any subsequent stack updates where the deployment contains sahara services will need to include the -e /environments/services/sahara.yaml environment file. Related-Bug: 1630247 Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
2016-09-29Fix races in major-upgrade-pacemaker Step2Michele Baldessari1-0/+22
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status |grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965