Age | Commit message (Collapse) | Author | Files | Lines |
|
We have the following function in the upgrade process after we updated
the packages and called the db-sync commands:
services=$(services_to_migrate)
...
for service in $(services); do
manage_systemd_service start "${service%%-clone}"
check_resource_systemd "${service%%-clone}" started 600
done
The above is broken because $services contains a list of services to
start, so $(services) will return gibberish and the for loop will never
execute anything.
One of the symptoms for this is the openstack-nova-compute service not
restarting on the compute nodes during the yum -y upgrade. The reason
for this is that during the service restart, nova-compute waits for
nova-conductor to show up in the rabbitmq queues, which cannot happen
since the service was actually never started.
Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3
Closes-Bug: #1630580
|
|
The default path if the operator does nothing is to keep the
sahara services on mitaka to newton upgrades.
If the operator wishes to remove sahara services then they
need to specify the provided major-upgrade-remove-sahara.yaml
environment file in the stack upgrade commands.
The existing migration to ha arch already removes the constraints
and pcs resource for sahara api/engine so we just need to stop
it from starting again if we want to remove it.
This adds a KeepSaharaServiceOnUpgrade parameter to determine if
Sahara is disabled from starting up after the controllers are
upgraded (defaults true).
Finally it is worth noting that we default the sahara services
as 'on' during converge here in the resource_registry of the
converge environment file; any subsequent stack updates where
the deployment contains sahara services will need to
include the -e /environments/services/sahara.yaml environment
file.
Related-Bug: 1630247
Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
|
|
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
has the following code:
...
check_resource mongod started 600
if [[ -n $(is_bootstrap_node) ]]; then
...
tstart=$(date +%s)
while ! clustercheck; do
sleep 5
tnow=$(date +%s)
if (( tnow-tstart > galera_sync_timeout )) ; then
echo_error "ERROR galera sync timed out"
exit 1
fi
done
# Run all the db syncs
cinder-manage db sync
...
fi
start_or_enable_service rabbitmq
check_resource rabbitmq started 600
start_or_enable_service redis
check_resource redis started 600
start_or_enable_service openstack-cinder-volume
check_resource openstack-cinder-volume started 600
systemctl_swift start
for service in $(services_to_migrate); do
manage_systemd_service start "${service%%-clone}"
check_resource_systemd "${service%%-clone}" started 600
done
"""
The problem with the above code is that it is open to the following race
condition:
1) Bootstrap node is busy checking the galera status via cluster check
2) Non-bootstrap node has already reached: start_or_enable_service
rabbitmq and later lines. These lines will be skipped because
start_or_enable_service is a noop on non-bootstrap nodes and
check_resource rabbitmq only checks that pcs status |grep rabbitmq
returns true.
3) Non-bootstrap node can then reach the manage_systemd_service start
and it will fail with stuff like:
"Job for openstack-nova-scheduler.service failed because the control
process exited with error code. See \"systemctl status
openstack-nova-scheduler.service\" and \"journalctl -xe\" for
details.\n" (because the db tables are not migrated yet)
This happens because 3) was started on non-bootstrap nodes before the
db-sync statements are complete on the bootstrap node. I did not feel
like changing the semantics of check_resource and remove the noop on
non-bootstrap nodes as other parts of the tree might rely on this
behaviour.
Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed
Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9
Closes-Bug: #1627965
|