Age | Commit message (Collapse) | Author | Files | Lines |
|
https://review.openstack.org/#/c/388688/ has removed ceilometer-dbsync so
ceilometer-upgrade must be used instead.
Additionally, ceilometer-dbsync enabled option --skip-gnocchi-resource-types
and ceilometer-upgrade doesn't, so i'm setting it by default to ensure backwards compatibility.
Note this is based on the corresponding fix to puppet-ceilometer ref
https://review.openstack.org/#/c/396570
Change-Id: Ic0a15c75d1cd3e3f70eeafd9ba09d50c58cc1293
Closes-Bug: #1641076
|
|
Currently when we call the major-upgrade step we do the following:
"""
...
if [[ -n $(is_bootstrap_node) ]]; then
check_clean_cluster
fi
...
if [[ -n $(is_bootstrap_node) ]]; then
migrate_full_to_ng_ha
fi
...
for service in $(services_to_migrate); do
manage_systemd_service stop "${service%%-clone}"
...
done
"""
The problem with the above code is that it is open to the following race
condition:
1. Code gets run first on a non-bootstrap controller node so we start
stopping a bunch of services
2. Pacemaker notices will notice that services are down and will mark
the service as stopped
3. Code gets run on the bootstrap node (controller-0) and the
check_clean_cluster function will fail and exit
4. Eventually also the script on the non-bootstrap controller node will
timeout and exit because the cluster never shut down (it never actually
started the shutdown because we failed at 3)
Let's make sure we first only call the HA NG migration step as a
separate heat step. Only afterwards we start shutting down the systemd
services on all nodes.
We also need to move the STONITH_STATE variable into a file because it
is being used across two different scripts (1 and 2) and we need to
store that state.
Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com>
Closes-Bug: #1640407
Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
|
|
gnocchi when configured with swift will require keystone
to be available to authenticate to migrate to v3. At this
step keystone is not available and gnocchi upgrade fails
with auth error. Instead start apache in step 3, start
apache first and then run gnocchi upgrade in a separate
step and let upgrade happen here.
Closes-Bug: #1634897
Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0
|
|
We have the following function in the upgrade process after we updated
the packages and called the db-sync commands:
services=$(services_to_migrate)
...
for service in $(services); do
manage_systemd_service start "${service%%-clone}"
check_resource_systemd "${service%%-clone}" started 600
done
The above is broken because $services contains a list of services to
start, so $(services) will return gibberish and the for loop will never
execute anything.
One of the symptoms for this is the openstack-nova-compute service not
restarting on the compute nodes during the yum -y upgrade. The reason
for this is that during the service restart, nova-compute waits for
nova-conductor to show up in the rabbitmq queues, which cannot happen
since the service was actually never started.
Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3
Closes-Bug: #1630580
|
|
The default path if the operator does nothing is to keep the
sahara services on mitaka to newton upgrades.
If the operator wishes to remove sahara services then they
need to specify the provided major-upgrade-remove-sahara.yaml
environment file in the stack upgrade commands.
The existing migration to ha arch already removes the constraints
and pcs resource for sahara api/engine so we just need to stop
it from starting again if we want to remove it.
This adds a KeepSaharaServiceOnUpgrade parameter to determine if
Sahara is disabled from starting up after the controllers are
upgraded (defaults true).
Finally it is worth noting that we default the sahara services
as 'on' during converge here in the resource_registry of the
converge environment file; any subsequent stack updates where
the deployment contains sahara services will need to
include the -e /environments/services/sahara.yaml environment
file.
Related-Bug: 1630247
Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
|
|
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
has the following code:
...
check_resource mongod started 600
if [[ -n $(is_bootstrap_node) ]]; then
...
tstart=$(date +%s)
while ! clustercheck; do
sleep 5
tnow=$(date +%s)
if (( tnow-tstart > galera_sync_timeout )) ; then
echo_error "ERROR galera sync timed out"
exit 1
fi
done
# Run all the db syncs
cinder-manage db sync
...
fi
start_or_enable_service rabbitmq
check_resource rabbitmq started 600
start_or_enable_service redis
check_resource redis started 600
start_or_enable_service openstack-cinder-volume
check_resource openstack-cinder-volume started 600
systemctl_swift start
for service in $(services_to_migrate); do
manage_systemd_service start "${service%%-clone}"
check_resource_systemd "${service%%-clone}" started 600
done
"""
The problem with the above code is that it is open to the following race
condition:
1) Bootstrap node is busy checking the galera status via cluster check
2) Non-bootstrap node has already reached: start_or_enable_service
rabbitmq and later lines. These lines will be skipped because
start_or_enable_service is a noop on non-bootstrap nodes and
check_resource rabbitmq only checks that pcs status |grep rabbitmq
returns true.
3) Non-bootstrap node can then reach the manage_systemd_service start
and it will fail with stuff like:
"Job for openstack-nova-scheduler.service failed because the control
process exited with error code. See \"systemctl status
openstack-nova-scheduler.service\" and \"journalctl -xe\" for
details.\n" (because the db tables are not migrated yet)
This happens because 3) was started on non-bootstrap nodes before the
db-sync statements are complete on the bootstrap node. I did not feel
like changing the semantics of check_resource and remove the noop on
non-bootstrap nodes as other parts of the tree might rely on this
behaviour.
Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed
Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9
Closes-Bug: #1627965
|