diff options
author | Michele Baldessari <michele@acksyn.org> | 2016-09-27 18:18:33 +0200 |
---|---|---|
committer | Michele Baldessari <michele@acksyn.org> | 2016-09-29 07:41:28 +0200 |
commit | ad07a29f9474ef68d0edf218638fe6030c019629 (patch) | |
tree | 14bb0597325f6f77af01fe0d2cb6d0f2565b0aa3 | |
parent | 89efa79599ca4bd5669136eca1e5670926f02b76 (diff) |
Fix races in major-upgrade-pacemaker Step2
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
has the following code:
...
check_resource mongod started 600
if [[ -n $(is_bootstrap_node) ]]; then
...
tstart=$(date +%s)
while ! clustercheck; do
sleep 5
tnow=$(date +%s)
if (( tnow-tstart > galera_sync_timeout )) ; then
echo_error "ERROR galera sync timed out"
exit 1
fi
done
# Run all the db syncs
cinder-manage db sync
...
fi
start_or_enable_service rabbitmq
check_resource rabbitmq started 600
start_or_enable_service redis
check_resource redis started 600
start_or_enable_service openstack-cinder-volume
check_resource openstack-cinder-volume started 600
systemctl_swift start
for service in $(services_to_migrate); do
manage_systemd_service start "${service%%-clone}"
check_resource_systemd "${service%%-clone}" started 600
done
"""
The problem with the above code is that it is open to the following race
condition:
1) Bootstrap node is busy checking the galera status via cluster check
2) Non-bootstrap node has already reached: start_or_enable_service
rabbitmq and later lines. These lines will be skipped because
start_or_enable_service is a noop on non-bootstrap nodes and
check_resource rabbitmq only checks that pcs status |grep rabbitmq
returns true.
3) Non-bootstrap node can then reach the manage_systemd_service start
and it will fail with stuff like:
"Job for openstack-nova-scheduler.service failed because the control
process exited with error code. See \"systemctl status
openstack-nova-scheduler.service\" and \"journalctl -xe\" for
details.\n" (because the db tables are not migrated yet)
This happens because 3) was started on non-bootstrap nodes before the
db-sync statements are complete on the bootstrap node. I did not feel
like changing the semantics of check_resource and remove the noop on
non-bootstrap nodes as other parts of the tree might rely on this
behaviour.
Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed
Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9
Closes-Bug: #1627965
-rwxr-xr-x | extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh | 17 | ||||
-rwxr-xr-x | extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh | 22 | ||||
-rw-r--r-- | extraconfig/tasks/major_upgrade_pacemaker.yaml | 19 |
3 files changed, 41 insertions, 17 deletions
diff --git a/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh b/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh index 80fab445..fc365939 100755 --- a/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh +++ b/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh @@ -68,20 +68,3 @@ if [[ -n $(is_bootstrap_node) ]]; then #TODO(marios):someone from sahara needs to check this: # sahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head fi - -start_or_enable_service rabbitmq -check_resource rabbitmq started 600 -start_or_enable_service openstack-cinder-volume -check_resource openstack-cinder-volume started 600 - - -# Swift isn't controled by pacemaker -systemctl_swift start - -# We need to start the systemd services we explicitely stopped at step _1.sh -# FIXME: Should we let puppet during the convergence step do the service enabling or -# should we add it here? -for service in $(services_to_migrate); do - manage_systemd_service start "${service%%-clone}" - check_resource_systemd "${service%%-clone}" started 600 -done diff --git a/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh b/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh new file mode 100755 index 00000000..4d72fbd8 --- /dev/null +++ b/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh @@ -0,0 +1,22 @@ +#!/bin/bash + +set -eu + +start_or_enable_service rabbitmq +check_resource rabbitmq started 600 +start_or_enable_service redis +check_resource redis started 600 +start_or_enable_service openstack-cinder-volume +check_resource openstack-cinder-volume started 600 + + +# Swift isn't controled by pacemaker +systemctl_swift start + +# We need to start the systemd services we explicitely stopped at step _1.sh +# FIXME: Should we let puppet during the convergence step do the service enabling or +# should we add it here? +for service in $(services_to_migrate); do + manage_systemd_service start "${service%%-clone}" + check_resource_systemd "${service%%-clone}" started 600 +done diff --git a/extraconfig/tasks/major_upgrade_pacemaker.yaml b/extraconfig/tasks/major_upgrade_pacemaker.yaml index a2a1bb5d..30ae8d1e 100644 --- a/extraconfig/tasks/major_upgrade_pacemaker.yaml +++ b/extraconfig/tasks/major_upgrade_pacemaker.yaml @@ -120,3 +120,22 @@ resources: config: {get_resource: ControllerPacemakerUpgradeConfig_Step2} input_values: {get_param: input_values} + ControllerPacemakerUpgradeConfig_Step3: + type: OS::Heat::SoftwareConfig + properties: + group: script + config: + list_join: + - '' + - - get_file: pacemaker_common_functions.sh + - get_file: major_upgrade_pacemaker_migrations.sh + - get_file: major_upgrade_controller_pacemaker_3.sh + + ControllerPacemakerUpgradeDeployment_Step3: + type: OS::Heat::SoftwareDeploymentGroup + depends_on: ControllerPacemakerUpgradeDeployment_Step2 + properties: + servers: {get_param: [servers, Controller]} + config: {get_resource: ControllerPacemakerUpgradeConfig_Step3} + input_values: {get_param: input_values} + |