aboutsummaryrefslogtreecommitdiffstats
path: root/extraconfig/tasks/major_upgrade_pacemaker.yaml
AgeCommit message (Collapse)AuthorFilesLines
2016-09-29Fix races in major-upgrade-pacemaker Step2Michele Baldessari1-0/+19
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status |grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965
2016-09-25get_param calls with multiple arguments need brackets around themMichele Baldessari1-4/+4
This issue was spotted during major upgrade where we had calls like this: servers: {get_param: servers, Controller} These get_param calls are hanging indefinitely and make the whole upgrade end in a timeout. We need to put brackets around the get_param function when there are multiple arguments: http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#get-param This is already done in most of the tree, and the few places where this was not happening were parts not under CI. After this change the following grep returns only one false positive: grep -ir get_param: |grep -v -- '\[' |grep ',' Change-Id: I65b23bb44f37b93e017dd15a5212939ffac76614 Closes-Bug: #1626628
2016-09-16Merge "Refactor upgrade checks."Jenkins1-0/+1
2016-09-16Merge "Convert UpdateWorkflow to support composable roles"Jenkins1-13/+5
2016-09-16Fix use of batch_create in CephMon major upgrade templateMathieu Bultel1-1/+2
The batch_create and rolling_update keys were incorrectly defined as properties of the resource instead of update policies. Change-Id: I19261adc78e4cdc3616f16221e85490a6b48d47b Closes-Bug: 1623506
2016-09-16Convert UpdateWorkflow to support composable rolesSteven Hardy1-13/+5
We need to remove the hard-coded roles from overcloud.j2.yaml as now it's valid to e.g remove BlockStorage completely. The previous behavior for the per-role upgrade scripts is maintained but we'll need to rework this for newton->ocata upgrades where we can no longer be sure the servers mapping will contain all roles. Change-Id: I25e6c84757e3c00fba2aae834cd8206c62e44acf Partially-Implements: blueprint custom-roles
2016-09-12Refactor upgrade checks.Sofer Athlan-Guyot1-0/+1
We make it clear that recoverable checks happen before starting the upgrade to be able to run the upgrade after the offending error has been manually corrected. Add new check for the pcsd cluster status. Add new check for galera password file: BZ 1357112 Closes-Bug: 1614907 Change-Id: If736c79121e1ffe0eaeb814bdb73ccbc0b64edcd
2016-08-30Add Ceph cluster health validation on upgradeGiulio Fidente1-1/+14
This will prevent the Ceph Mon upgrade script from starting if the Ceph cluster is in error state. It also adds a parameter to ignore warning states, useful when performing an upgrade of a cluster where the number of healthy OSDs does not guarantee the desired replica size. Closes-Bug: 1618533 Change-Id: I1beb8ad0812f19b1018ba19b5a9fc85fa132d7f7
2016-08-29Upgrade ceph-monGiulio Fidente1-0/+18
Adds a pre-requisite software deployment to the pacemaker scenario upgrade which, before the openstack services are upgraded, upgrades the ceph-mon daemon from Hammer to Jewel. Change-Id: I9855d80a6ae156b4a9e0409c3c927068b9db95a0
2016-06-29Dump and restore galera db during major upgradesMichele Baldessari1-0/+12
When the overcloud is upgraded we do a yum update of the packages. This step might introduce a newer galera version. In such a situation we need to dump the db and restore it. The high-level workflow should be the following: 1) During the main upgrade step, before shutting down the cluster we need to dump the db 2) We upgrade the packages 3) We briefly start mysql on a single node while making sure that /root/.my.cnf is briefly moved out of the way (because it contains a password) and import the data. After the import we shutdown this mysql instance 4) We let the cluster start up normally The above steps will take place in the following scenarios. Given a locally installed mariadb version X.Y.Z and release R, we will dump and restore the DB under the following conditions: A) MySqlMajorUpgrade template parameter is set to 'auto' and the upgraded package differs in X, Y *or* Z. We basically don't dump automatically if the release field changes. B) MySqlMajorUpgrade template parameter is set to 'yes' When MySqlMajorUpgrade is set to 'no', no dumping will be performed. Note that this will give a non functional upgrade if a major mariadb upgrade is taking place. Partial-Bug: #1587449 Co-Author: Damien Ciabrin <dciabrin@redhat.com> Co-Author: Mike Bayer <mbayer@redhat.com> Depends-On: I8cb4cb3193e6b823aad48ad7dbbbb227364d2a58 Depends-On: I38dcacfabc44539aab1f7da85168fe44a1b43a51 Change-Id: I374628547aed091129d0deaa29764bfc998d76ea
2016-03-10Merge "Upgrade of Cinder block storage nodes"Jenkins1-1/+15
2016-03-07Merge "Function library for major upgrades"Jenkins1-0/+2
2016-03-03Function library for major upgradesJiri Stransky1-0/+2
This commit introduces a bash file to be sourced into major upgrade scripts. Into this file we can put specific pieces of migration logic in the form of bash functions, which can then be called from the upgrade scripts. Change-Id: Ibf7aa84d3880e9218c488dec9d707300e1784744
2016-03-03Introduce a UpgradeScriptDeliveryWorfklow as part of tripleo upgradesmarios1-25/+0
This splits the upgrade script delivery out of the UpgradeWorkflow and into a new task which delivers the upgrade script for compute and object-storage nodes. This is intended to be the first part of the upgrades process, since we need to upgrade swift nodes before the controllers and then only one at a time. So this will deliver the upgrade script which can be invoked by the operator using the existing script in tripleo-common 'upgrade-non-controller.sh'. This can be invoked by passing the -e environments/major-upgrade-script-delivery.yaml (added here) to the openstack overcloud deploy command. Change-Id: I20a0d4978e907111404f8108c502ab53b69a3296
2016-03-03Upgrade of Cinder block storage nodesJiri Stransky1-1/+15
This introduces upgrades for Cinder block storage nodes. Currently Cinder doesn't support upgrade level pinning and cannot safely deal with version skew. This means that we have to upgrade Cinder storage nodes in sync with controller nodes (after they were taken down for upgrade, before they are brought back up) to ensure that Cinder services perform AMQP communication only within the same major version of Cinder. According to our current knowledge, Cinder block storage nodes are the only node type that will have to be upgraded in sync with controllers. Change-Id: Icec913c015eff744b0f31b513176b4b657df43af
2016-02-26Write the compute upgrade script for tripleo major upgrade workflowmarios1-0/+26
As part of the major upgrade workflow non-controller nodes are to be updated by the operator, out-of-band and only after an initial heat stack-update that invokes the upgrade of the controller nodes. This review adds a ComputeDeliverUpgradeConfigDeployment_Step3 SoftwareDeploymentGroup to be applied only to compute nodes, and that depends on the controllers having been upgraded after ControllerPacemakerUpgradeConfig_Step2. Its purpose is to deliver but not invoke the upgrade script on compute nodes to /root/tripleo_upgrade_node.sh . The non-controller nodes will then be upgraded later by an operator that will run the script provided for that purpose, like at https://review.openstack.org/#/c/284722/1 for example. Change-Id: Ic6115fc8cf5320abfcf500112ff563bde8b88661
2016-02-23Add UpgradeLevelNovaCompute parameterJiri Stransky1-1/+16
This parameter can be used for pinning (and later unpinning) the Nova Compute RPC version. Change-Id: I2f181f3b01f0b8059566d01db0152a12bbbd1c3e
2016-02-23Introduce update/upgrade workflowJiri Stransky1-6/+14
Change-Id: I7226070aa87416e79f25625647f8e3076c9e2c9a
2016-02-23Add resources for major upgrade in Pacemaker scenarioDerek Higgins1-0/+45
Add Heat software deployments to be used to upgrade major versions of OpenStack on the controller nodes. All controller services are taken down while the upgrade is in progress. The new updated yum repositories should be configured by another process e.g. the deployment artifacts transfer via Swift. Change-Id: Ia0a04e4a11d67e7a5acc53c1f8a8f01ed5ca8675 Co-Authored-By: Giulio Fidente <gfidente@redhat.com> Co-Authored-By: Jiri Stransky <jistr@redhat.com>