aboutsummaryrefslogtreecommitdiffstats
path: root/extraconfig/tasks
AgeCommit message (Collapse)AuthorFilesLines
2016-03-09Upgrades: object storage node upgrade fixJiri Stransky1-3/+4
The variables in the heredoc should be escaped because they should evaluate only when the inner script runs, not when the outer "writer" script runs. Python-zaqarclient is installed for os-collect-config to work, as we do on the other node types. Swift-proxy is removed from list of services to stop/start, as swift-proxy isn't supposed to run on the swift storage nodes. Change-Id: I8426b859d11378ebdc3da94dcc090133dab0c628
2016-03-09Fixup systemctl_swift stop/start during the controller upgrademarios1-4/+17
During the controller upgrade in major_upgrade_controller_pacemaker_1.sh we use systemctl to stop all swift services and then start them again in _pacemaker_2.sh In the case of stand-alone swift nodes the deployer may have used the ControllerEnableSwiftStorage: false so that only the swift-proxy service is left on controllers (wrt swift). The systemctl_swift function used during upgrades is changed to factor this in. Change-Id: Ib22005123429f250324df389855d0dccd2343feb
2016-03-09Upgrades: initialization command/snippetJiri Stransky1-0/+52
This allows to run a command or a script snippet on all overcloud nodes at the beginning of the upgrade. The intended use is to switch to a new set of repositories on the overcloud. This is done differently in different contexts (e.g. upstream vs. downstream), but generally it should be simple enough to not warrant creation of switchable "UpgradeInit" resource in the resource registry, and a string command/snippet parameter should suffice. Change-Id: I72271170d3f53a5179b3212ec9bae9a6204e29e6
2016-03-08Add a ceph-storage node upgrade script for the upgrade workflowmarios2-4/+50
This adds delivery of an upgrade script to any ceph-storage nodes during the script delivery that comes first during the upgrade workflow. The controllers have the ceph-mon whilst the ceph-osds are on the ceph-storage nodes. The ceph-mons will be updated first as part of the heat-driven controller upgrade, and ceph-osds on ceph nodes are upgraded with the upgrade-non-controller.sh tripleo-common script as with compute and swift nodes. Also slight rename for the ObjectStorageConfig/Deployment here for consistency. Change-Id: I12abad5548dcb019ade9273da06fe66fd97f54cc
2016-03-07Merge "Revert "Deploy Aodh services, replacing Ceilometer Alarm""Jenkins1-7/+2
2016-03-07Merge "Function library for major upgrades"Jenkins2-0/+16
2016-03-07Merge "Introduce a UpgradeScriptDeliveryWorfklow as part of tripleo upgrades"Jenkins3-25/+103
2016-03-04Revert "Deploy Aodh services, replacing Ceilometer Alarm"James Slagle1-7/+2
This just a revert to see if reverting this gets back to a normal CI run time. This reverts commit f72aed85594f223b6f888e6d0af3c880ea581a66. Change-Id: I04a0893f6cf69f547a4db26261005e580e1fc90b
2016-03-03Merge "Deploy Aodh services, replacing Ceilometer Alarm"Jenkins1-2/+7
2016-03-03Deploy Aodh services, replacing Ceilometer AlarmEmilien Macchi1-2/+7
Ceilometer Alarm is deprecated in Liberty by Aodh. This patch: * manage Aodh Keystone resources * deploy Aodh API under WSGI, Notifier, Listener and Evaluator * manage new parameters to customize Aodh deployment * uses ceilometer DB for the upgrade path * pacemaker config Depends-On: I9e34485285829884d9c954b804e3bdd5d6e31635 Depends-On: I891985da9248a88c6ce2df1dd186881f582605ee Depends-On: Ied8ba5985f43a5c5b3be5b35a091aef6ed86572f Co-Authored-By: Pradeep Kilambi <pkilambi@redhat.com> Change-Id: I58d419173e80d2462accf7324c987c71420fd5f6
2016-03-03Function library for major upgradesJiri Stransky2-0/+16
This commit introduces a bash file to be sourced into major upgrade scripts. Into this file we can put specific pieces of migration logic in the form of bash functions, which can then be called from the upgrade scripts. Change-Id: Ibf7aa84d3880e9218c488dec9d707300e1784744
2016-03-03Introduce a UpgradeScriptDeliveryWorfklow as part of tripleo upgradesmarios3-25/+103
This splits the upgrade script delivery out of the UpgradeWorkflow and into a new task which delivers the upgrade script for compute and object-storage nodes. This is intended to be the first part of the upgrades process, since we need to upgrade swift nodes before the controllers and then only one at a time. So this will deliver the upgrade script which can be invoked by the operator using the existing script in tripleo-common 'upgrade-non-controller.sh'. This can be invoked by passing the -e environments/major-upgrade-script-delivery.yaml (added here) to the openstack overcloud deploy command. Change-Id: I20a0d4978e907111404f8108c502ab53b69a3296
2016-03-03Upgrade of Cinder block storage nodesJiri Stransky2-1/+23
This introduces upgrades for Cinder block storage nodes. Currently Cinder doesn't support upgrade level pinning and cannot safely deal with version skew. This means that we have to upgrade Cinder storage nodes in sync with controller nodes (after they were taken down for upgrade, before they are brought back up) to ensure that Cinder services perform AMQP communication only within the same major version of Cinder. According to our current knowledge, Cinder block storage nodes are the only node type that will have to be upgraded in sync with controllers. Change-Id: Icec913c015eff744b0f31b513176b4b657df43af
2016-03-02Moves the swift start/stop into the common_functions.sh filemarios3-10/+11
Since swift isn't managed by pacemaker we need to manually (systemctl) stop and start the swift services. This moves the duplicate blocks for start/stop into a common function (we already include that pacemaker_common_functions.sh here so may as well) Change-Id: Ic4f23212594c1bf9edc39143bf60c7f6d648fd1d
2016-03-02Upgrades: install zaqarclientJiri Stransky2-0/+3
Old overcloud images don't have python-zaqarclient installed, and new overclouds' os-collect-config are configured with Zaqar support. This together means that on upgrade we need to install python-zaqarclient, otherwise os-collect-config will be restarted during yum update and crash due to trying to import missing Python module from zaqarclient. Change-Id: I3e875e14cb60b1b78aec0d9ddc412ccf865abd01
2016-03-02Upgrades: quiet yum updateJiri Stransky1-1/+1
Quiet down yum during major upgrades to reduce the output size. This is consistent with what was introduced into minor updates in change I517271e8465885421a78b73c5af756816c37a977. Change-Id: Ie6b470e383fdf42870ac6f60ca43e44b4c446ebe
2016-02-29Merge "Write the compute upgrade script for tripleo major upgrade workflow"Jenkins2-0/+50
2016-02-26Merge "Add meta notify=true to rabbitmq resource"Jenkins1-0/+3
2016-02-26Write the compute upgrade script for tripleo major upgrade workflowmarios2-0/+50
As part of the major upgrade workflow non-controller nodes are to be updated by the operator, out-of-band and only after an initial heat stack-update that invokes the upgrade of the controller nodes. This review adds a ComputeDeliverUpgradeConfigDeployment_Step3 SoftwareDeploymentGroup to be applied only to compute nodes, and that depends on the controllers having been upgraded after ControllerPacemakerUpgradeConfig_Step2. Its purpose is to deliver but not invoke the upgrade script on compute nodes to /root/tripleo_upgrade_node.sh . The non-controller nodes will then be upgraded later by an operator that will run the script provided for that purpose, like at https://review.openstack.org/#/c/284722/1 for example. Change-Id: Ic6115fc8cf5320abfcf500112ff563bde8b88661
2016-02-23Add UpgradeLevelNovaCompute parameterJiri Stransky2-2/+17
This parameter can be used for pinning (and later unpinning) the Nova Compute RPC version. Change-Id: I2f181f3b01f0b8059566d01db0152a12bbbd1c3e
2016-02-23Introduce update/upgrade workflowJiri Stransky5-6/+59
Change-Id: I7226070aa87416e79f25625647f8e3076c9e2c9a
2016-02-23Add resources for major upgrade in Pacemaker scenarioDerek Higgins3-0/+174
Add Heat software deployments to be used to upgrade major versions of OpenStack on the controller nodes. All controller services are taken down while the upgrade is in progress. The new updated yum repositories should be configured by another process e.g. the deployment artifacts transfer via Swift. Change-Id: Ia0a04e4a11d67e7a5acc53c1f8a8f01ed5ca8675 Co-Authored-By: Giulio Fidente <gfidente@redhat.com> Co-Authored-By: Jiri Stransky <jistr@redhat.com>
2016-02-23Add meta notify=true to rabbitmq resourceMichele Baldessari1-0/+3
See RHBZ 1311005 and 1247303. In short: sometimes when a controller node gets fenced, rabbitmq is unable to rejoin the cluster. To fix this we need two steps: 1) The fix for the RA in BZ 1247303 2) Add notify=true to the meta parameters of the rabbitmq resource on fresh installs and updates Note that if this change is applied on systems that do not have the fix for the rabbitmq resource agent, no action is taken. So when the resource agent will be updated, the notify operation will start to work as soon as the first monitor action will take place. Fixes RH Bug #1311005 Change-Id: I513daf6d45e1a13d43d3c404cfd6e49d64e51d5a
2016-02-16Merge "Split pacemaker common check_service function out of _restart.sh"Jenkins3-34/+44
2016-02-16Merge "Use timeout to check for services status"Jenkins1-11/+12
2016-02-08Pass -q option to yumZane Bitter1-2/+2
The maximum payload size of the return signal from a Heat software deployment is 1MB, and the output of yum starts breaking this limit at ~1000 packages to update - which is not an atypical number. To prevent this, pass the -q (quiet) option to reduce the amount of output to a manageable level. Change-Id: I517271e8465885421a78b73c5af756816c37a977 Resolves-rhbz: #1304878 Closes-Bug: #1543034
2016-01-26Increase galera sync timeout in yum_update.shJiri Stransky1-1/+1
We've seen the 360 second threshold broken and a failed update because of that, even though Galera eventually synced fine, clusterchecks OK and pcs status clean. This will give Galera more time to perform the sync. Change-Id: I17207ec9b4038fb9540582c9b0b717f9b85a78b9 Closes-Bug: #1538218
2016-01-26Split pacemaker common check_service function out of _restart.shGiulio Fidente3-34/+44
Also split out echo_error function to DRY the error output code and allow changing the way we report errors in a single place. Change-Id: I448bf0eb49390f03155335736bb4ab4e979db128 Co-Authored-By: Jiri Stransky <jistr@redhat.com>
2016-01-26Use timeout to check for services statusGiulio Fidente1-11/+12
Replaces the bash loop with the timeout command in the piloted cluster restart to minimize downtime. Change-Id: I9067eed9626ae5aff833d7a9a9ad1e1a6c026327 Co-Authored-By: Jiri Stransky <jistr@redhat.com>
2016-01-17Let Puppet update all packages on non-controllersJames Slagle1-8/+5
With I02f7cf07792765359f19fdf357024d9e48690e42[1] in puppet-tripleo, puppet is capable of updating all packages itself on non controller nodes now. This is a safer mechanism than using the exclude logic in yum_update.sh since that can cause depdency problems across sub packages. [1] https://review.openstack.org/#/c/261041/ Closes-Bug: 1534785 Change-Id: I9075a1bb85baa65a9d0afc5d0fd31a1f99a98819
2016-01-06Merge "Ensure cluster remains stable during services restarts"Jenkins1-2/+3
2016-01-05Bump the pacemaker service op_params to 200s for start and stopmarios1-2/+2
Based on observed timeouts during updates bump the stop and start timeouts for pacemaker service resources (via op_params) to 200. This is based on the reasoning that the full timeout may be as long as two elapsed timeout intervals. After an initial timeout, the sigterm that follows is then allowed another DefaultTimeoutStopSec seconds. The 200s is produced by allowing this 2xDefaultTimeoutStopSec (@90s for systemd) and some scheduling delta. Many thanks to Michele Baldessari. Closes-Bug: 1531204 Change-Id: If6b43982c958f63bc78ad997400bf1279c23df7e
2016-01-05Ensure cluster remains stable during services restartsGiulio Fidente1-2/+3
Using crm_resource --wait we wait for the cluster to get into a stable state before moving into the next step of the piloted restart procedure. Change-Id: I80199653024383fd07900dad0b8d23fb8afade26 Co-Authored-By: Jiri Stransky <jistr@redhat.com>
2016-01-05Merge "Wait for cluster to settle in yum_update.sh"Jenkins1-1/+11
2016-01-04Wait for cluster to settle in yum_update.shJiri Stransky1-1/+11
Occasionally we hit "Error: unable to push cib" during update. This is probably due to the fact that when we try to replace cib in yum_update.sh, services on the previous updated controller are still coming up and changing cib, and racing/conflicting with the cib push from yum_update.sh. This commit adds waiting for the cluster to settle before exiting from yum_update.sh, to avoid this kind of conflict. Also a check for cib-push success is added, to make the update fail properly instead of hanging indefinitely as we've observed with this issue. Change-Id: I953087e0e565474ac553fd57bea2459d2e3a6081 Closes-Bug: #1527644
2015-12-15Add fixup for pcs order constraints after update to new templatesmarios1-0/+6
In https://review.openstack.org/#/c/248572/ yum_update.sh sets the pcs constraints before restarting the cluster. However after post-update pacemaker run, the previous constraint of neutron-server...neutron-ovs-cleanup is re-added. Explicitly remove this before the post-update restart of certain services Change-Id: I84dd650dcc66ce3f48926cf369b7d691014c2254
2015-12-14Pacemaker maintenance mode for the duration of Puppet run on updateSteven Hardy4-0/+147
This enables pacemaker maintenantce mode when running Puppet on stack update. Puppet can try to restart some overcloud services, which pacemaker tries to prevent, and this can result in a failed Puppet run. At the end of the puppet run, certain pacemaker resources are restarted in an additional SoftwareDeployment to make sure that any config changes have been fully applied. This is only done on stack updates (when UpdateIdentifier is set to something), because the assumption is that on stack create services already come up with the correct config. (Change I9556085424fa3008d7f596578b58e7c33a336f75 has been squashed into this one.) Change-Id: I4d40358c511fc1f95b78a859e943082aaea17899 Co-Authored-By: Jiri Stransky <jistr@redhat.com> Co-Authored-By: James Slagle <jslagle@redhat.com>
2015-12-03Merge "Add pcmk constraints against haproxy-clone only if applicable"Jenkins1-32/+34
2015-12-03Merge "Apply mongod timeout via cib-push"Jenkins1-1/+1
2015-12-02Add pcmk constraints against haproxy-clone only if applicableGiulio Fidente1-32/+34
When the Overcloud does not host an instance of haproxy, pcmk will not have any resource named haproxy-clone so we should not add any constraint relying on it. Change-Id: I801f07b7570f3805aa71c22998fec6b6f192b350
2015-11-25Merge "Update: clean keepalived and radvd instances after pcs cluster stop"Jenkins1-0/+7
2015-11-25Apply mongod timeout via cib-pushGiulio Fidente1-1/+1
We forgot to apply the mongod timeout in the cib dump first, to apply it later in a single cib-push step. Change-Id: Ib104e51782c6d3f646907cdb06c74fd4cbf9028c
2015-11-24Update: clean keepalived and radvd instances after pcs cluster stopJiri Stransky1-0/+7
Older neutron versions have a bug which makes them leave keepalived and radvd running even after all neutron services are stopped, preventing neutron router failover from happening. Router can then get stuck on the inactive node, like this: [stack@instack ~]$ neutron l3-agent-list-hosting-router default_router +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 48ca9477-b93b-4305-9e6d-9f1c5d3388f0 | overcloud-controller-1.localdomain | True | :-) | standby | | eba0575c-654f-4da6-b1cd-f7fdf1cd3726 | overcloud-controller-2.localdomain | True | :-) | standby | | 68815390-251f-4425-a5f8-38bdbf3bdb90 | overcloud-controller-0.localdomain | True | xxx | active | +--------------------------------------+------------------------------------+----------------+-------+----------+ We need to kill the leftover processes manually to prevent the state described above from happening. See https://review.gerrithub.io/#/c/248931 Change-Id: I2deaa176222983daa0c33ab52a6aa5dbe7365302
2015-11-23Fixup neutron constraints in older overclouds before updatingmarios1-0/+10
The neutron pcs constraints were reworked in https://review.openstack.org/#/c/229466/ For overclouds deployed with older tripleo-heat-templates the current pcs ordering constraints will not have those changes, meaning that the behaviour discussed at https://bugs.launchpad.net/tripleo/+bug/1501378 is likely given we will stop and restart all services. This review applies those, in short, remove the ovs-cleanup after neutron-server and add openvswitch-agent instead. Detail in the bug report and linked BZ. Change-Id: I45822c5fe9029f11635400b7fbd386880ac80a4e Related-Bug: 1501378
2015-11-19Add constraints and timeouts from file in single stepGiulio Fidente1-78/+50
To avoid pcmk reconfiguring the resources on each config change, we want to apply the constraints and timeouts from file. We also *do not* want to alter the timeouts for a few ocf resources which are rabbitmq, neutron-netns-cleanup and neutron-ovs-cleanup Change-Id: I6875f19e1f34f0fdcf0928421f49b61d857ca7c8 Co-Authored-By: Andrew Beekhof <abeekhof@redhat.com>
2015-11-17Verify galera is sync'd in yum_update.shJames Slagle1-0/+12
When the cluster is brought back online after a yum update in yum_update.sh, we should verify that galera is fully sync'd before moving on. This ensures the sync is complete before moving on to update any other nodes in the cluster. Change-Id: Ie8fc2c5d5214deacea94ca658ac75359b318ced1
2015-11-13Set start/stop pacemaker resource timeouts for updatesJiri Stransky1-0/+72
This matches change I6fc18f1ad876c5a25723710a3b20d8ec9519dcba, but we need it to set it before attempting the cluster stop - yum update - cluster start cycle, to make sure this cycle doesn't hit the low timeout limits. This can be removed once updates from deployments made prior to I6fc18f1ad876c5a25723710a3b20d8ec9519dcba are no longer supported. Change-Id: I587136d8d045d213875c657ea5a405074f80c8ad
2015-11-11Add missing constraints in yum_update.shJames Slagle1-0/+30
Some missing pacemaker constraints were added in the following commits: https://review.openstack.org/#/c/219770/ https://review.openstack.org/#/c/219665/ https://review.openstack.org/#/c/218931/ https://review.openstack.org/#/c/218930/ Overclouds that were deployed prior to these constraints being added to tripleo-heat-templates still have the constraints missing. During an update, stopping and starting the cluster can fail without these constraints in place. As a workaround, conditionally add these contraints in yum_update.sh so that we're sure they're always present before updating. Change-Id: Id46c85dbbe5e85d362279661091b17ce1b697fe0
2015-10-01Force stop a single node pacemaker on yum updateSteve Baker1-1/+7
Currently package updates won't occur on a single node non-HA pacemaker managed Controller because stopping the node loses the quorum of 1. This change gets the count of current nodes in the cluster and if the count is 1 then specify --force when doing a pcs cluster stop. Change-Id: I0de2488e24f1ef53a935dbc90ec6de6142bb4264
2015-10-01Make package upgrade pacemaker-awareSteve Baker1-7/+45
This change adds alternative logic for handling package updates on a pacemaker managed node. "yum list updates" is now run and this script exits early if there are no packages to update. If the pacemaker service is not running then the previous puppet logic remains, so a package update is performed which excludes packages managed by puppet, and a flag is set to indicate that puppet should perform an ensure=>latest on all packages it manages. However if the pacemaker service is running, the following occurs: - pcs cluster stop is run for this node - a full yum update is performed - pcs cluster start is run for this node - pcs status is run until the hostname for this node appears in the Online list This means that puppet is not involved in the package update process when the node is managed by pacemaker. Change-Id: I5ad118552d053dbda280978751167d9fd9da9874