aboutsummaryrefslogtreecommitdiffstats
path: root/extraconfig/tasks/major_upgrade_check.sh
AgeCommit message (Collapse)AuthorFilesLines
2016-11-09Fix race during major-upgrade-pacemaker stepMichele Baldessari1-7/+12
Currently when we call the major-upgrade step we do the following: """ ... if [[ -n $(is_bootstrap_node) ]]; then check_clean_cluster fi ... if [[ -n $(is_bootstrap_node) ]]; then migrate_full_to_ng_ha fi ... for service in $(services_to_migrate); do manage_systemd_service stop "${service%%-clone}" ... done """ The problem with the above code is that it is open to the following race condition: 1. Code gets run first on a non-bootstrap controller node so we start stopping a bunch of services 2. Pacemaker notices will notice that services are down and will mark the service as stopped 3. Code gets run on the bootstrap node (controller-0) and the check_clean_cluster function will fail and exit 4. Eventually also the script on the non-bootstrap controller node will timeout and exit because the cluster never shut down (it never actually started the shutdown because we failed at 3) Let's make sure we first only call the HA NG migration step as a separate heat step. Only afterwards we start shutting down the systemd services on all nodes. We also need to move the STONITH_STATE variable into a file because it is being used across two different scripts (1 and 2) and we need to store that state. Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com> Closes-Bug: #1640407 Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
2016-09-29Relax pre-upgrade check for failed actionsMichele Baldessari1-2/+2
Before this change we checked the cluster for any failed actions and we stopped the upgrade process if there were any. This is likely eccessive as a failed action could have happened in the past and the cluster is now fully functional. Better to check if any of the resources are in Stopped state and break the upgrade process if any of them are. We also need to restrict this check to the bootstrap node because otherwise the following might happen: 1) Bootstrap node does the check, it is successful and it starts the full HA -> HA NG migration which *will* create failed actions and will start stopping resources 2) If the check now starts on a non-bootstrap node while 1) is ongoing, it will find either failed actions or stopped resources so it will fail. Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f Closes-Bug: #1628653
2016-09-12Refactor upgrade checks.Sofer Athlan-Guyot1-0/+104
We make it clear that recoverable checks happen before starting the upgrade to be able to run the upgrade after the offending error has been manually corrected. Add new check for the pcsd cluster status. Add new check for galera password file: BZ 1357112 Closes-Bug: 1614907 Change-Id: If736c79121e1ffe0eaeb814bdb73ccbc0b64edcd