summaryrefslogtreecommitdiffstats
path: root/extraconfig/tasks
AgeCommit message (Collapse)AuthorFilesLines
2016-12-23Bump template version for all templates to "ocata"Steven Hardy9-9/+9
Heat now supports release name aliases, so we can replace the inconsistent mix of date related versions with one consistent version that aligns with the supported version of heat for this t-h-t branch. This should also help new users who sometimes copy/paste old templates and discover intrinsic functions in the t-h-t docs don't work because their template version is too old. Change-Id: Ib415e7290fea27447460baa280291492df197e54
2016-12-21Merge "Use df instead of findmnt in cephstorage upgrade scripts"Jenkins1-1/+1
2016-12-14Make the openvswitch 2.4->2.5 upgrade more robustmarios10-80/+57
In I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae and I11fcf688982ceda5eef7afc8904afae44300c2d9 we added a manual step for upgrading openvswitch in order to specify the --nopostun as discussed in the bug below. This change adds a minor update to make this workaround more robust. It removes any existing rpms that may be around from an earlier run, and also checks that the rpms installed are at least newer than the version we are on. This also refactors the code into a common definition in the pacemaker_common_functions.sh which is included even for the heredocs generating upgrade scripts during init. Thanks Sofer Athlan-Guyot and Jirka Stransky for help with that. Change-Id: Idc863de7b5a8c116c990ee8c1472cfe377836d37 Related-Bug: 1635205
2016-12-12Use df instead of findmnt in cephstorage upgrade scriptsGiulio Fidente1-1/+1
There are scenarios in which findmnt will return a list of all mounted filesystems, which causes the upgrade script to fail in recognizing if the Ceph OSD is backed by ext4. Change-Id: Iadebdc32b523c05216202b782ceb54bec4389413 Closes-Bug: #1649407
2016-11-24Merge "Run os-net-config before restarting cluster on update"Jenkins1-0/+11
2016-11-23Run os-net-config before restarting cluster on updateBrent Eagles1-0/+11
Running os-net-config before restarting the cluster prevents changes to the interface files caused by changes to implementation from bouncing network interfaces after the cluster has restarted. Closes-Bug: #1644138 Change-Id: I65fb104465ff3d37ddc791634302994334136014
2016-11-23Explicitly set rabbit hosts so its not overridden during upgradePradeep Kilambi1-1/+7
During ceilometer pre upgrade, rabbit host config gets overridden in ceilometer conf as its setting to defaults. This explicitly sets the host info in standalone manifest. Closes-Bug: #1644278 Change-Id: I862ea7165c5d42ba1f9a19111a8be8934c0ef883
2016-11-22Fix ovs 2.4 to 2.5 upgrade - minor update non controllersmarios1-14/+13
In I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae and I11fcf688982ceda5eef7afc8904afae44300c2d9 we landed a workaround for the openvswitch 2.4 to 2.5 upgrade discussed in the bug below. Unfortunately testing has revealed a problem with the minor update case specifically for non controllers. It seems we would exit before the ovs workaround has had a chance to execute. This moves the block up a few lines to avoid this condition. As with the other two reviews noted here, this will need to go into newton and then mitaka too. Change-Id: If905de82d96302334ebe02de9c43f00faed9b72b Related-Bug: 1635205
2016-11-16Merge "Fix up Newton->Ocata rabbitmq ha policy"Jenkins2-1/+21
2016-11-16Merge "Replace ceilometer-dbsync by ceilometer-upgrade"Jenkins1-1/+1
2016-11-15Replace ceilometer-dbsync by ceilometer-upgradeSteven Hardy1-1/+1
https://review.openstack.org/#/c/388688/ has removed ceilometer-dbsync so ceilometer-upgrade must be used instead. Additionally, ceilometer-dbsync enabled option --skip-gnocchi-resource-types and ceilometer-upgrade doesn't, so i'm setting it by default to ensure backwards compatibility. Note this is based on the corresponding fix to puppet-ceilometer ref https://review.openstack.org/#/c/396570 Change-Id: Ic0a15c75d1cd3e3f70eeafd9ba09d50c58cc1293 Closes-Bug: #1641076
2016-11-15Fix external Load Balancer deploymentMichele Baldessari1-2/+1
Deployments using external LB will file like this: deploy_stderr: | + RESTART_FOLDER=/var/lib/tripleo/pacemaker-restarts + [[ -d /var/lib/tripleo/pacemaker-restarts ]] ++ systemctl is-active haproxy + haproxy_status=unknown deploy_status_code: 3 openstack software deployment show 4f339ca4-7600-4ca0-b0ef-f798bc47b6cf The reason is that via https://review.openstack.org/#/c/393644/ we introducted the haproxy restart like this: haproxy_status=$(systemctl is-active haproxy) if [ "$haproxy_status" = "active" ]; then systemctl reload haproxy fi The problem is that if haproxy is not running/installed systemctl is-active can fail and the script will terminate with an error return code. Let's just move the call inside the if so the script does not fail in case haproxy is not there. The snippet before the change (on a system without haproxy installed): [root@mrg-09 tmp]# ./test.sh ++ systemctl is-active haproxy + haproxy_status=unknown [root@mrg-09 tmp]# echo $? 3 After this change: [root@mrg-09 tmp]# ./test.sh ++ systemctl is-active haproxy + '[' unknown = active ']' [root@mrg-09 tmp]# echo $? 0 Change-Id: I837c63a9dbcde8c922f843c442974fa79cf1eede Closes-Bug: #1641904
2016-11-14Fix up Newton->Ocata rabbitmq ha policyMichele Baldessari2-1/+21
In ocata we changed the ha policy to "ha-exactly" via the following changes: - tht: Iace6daf27a76cb8ef1050ada0de7ff1f530916c6 - puppet-tripleo: Ib62001c03e1e08f58cf0c6e0ba07a8879a584084 We initially also took care of changing this policy (which is set in the pacemaker resource agent) for the M/N upgrade path: I2468a096b5d7042bc801a742a7a85fb1521c1c02 In the end we decided against changing the policy in Newton as well (it was only for ocata) as it was too close to the release date and we took the safer path. This patch does two things: 1) It renames the upgrade function to "newton_ocata" since that is the only upgrade path we need to take care of 2) It reinstates the actual upgrade function which was mistakenly removed via an unrelated change in the ceilometer upgrade path: If9d6987cd0a8fc5d3f9de518ba422d97d5149732 Closes-Bug: #1628998 Change-Id: I3a97505d2ae1ae27f3080ffe74c33fdabffd2420
2016-11-10Merge "Fix race during major-upgrade-pacemaker step"Jenkins8-263/+315
2016-11-09Merge "Reload haproxy configuration as a post-deployment step"Jenkins1-3/+12
2016-11-09Fix race during major-upgrade-pacemaker stepMichele Baldessari8-263/+315
Currently when we call the major-upgrade step we do the following: """ ... if [[ -n $(is_bootstrap_node) ]]; then check_clean_cluster fi ... if [[ -n $(is_bootstrap_node) ]]; then migrate_full_to_ng_ha fi ... for service in $(services_to_migrate); do manage_systemd_service stop "${service%%-clone}" ... done """ The problem with the above code is that it is open to the following race condition: 1. Code gets run first on a non-bootstrap controller node so we start stopping a bunch of services 2. Pacemaker notices will notice that services are down and will mark the service as stopped 3. Code gets run on the bootstrap node (controller-0) and the check_clean_cluster function will fail and exit 4. Eventually also the script on the non-bootstrap controller node will timeout and exit because the cluster never shut down (it never actually started the shutdown because we failed at 3) Let's make sure we first only call the HA NG migration step as a separate heat step. Only afterwards we start shutting down the systemd services on all nodes. We also need to move the STONITH_STATE variable into a file because it is being used across two different scripts (1 and 2) and we need to store that state. Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com> Closes-Bug: #1640407 Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
2016-11-08ceilometer compute agent needs restart on compute upgradePradeep Kilambi1-0/+4
After compute nodes are upgraded, the ceilometer compute agent doesnt poll and throws warnings. Restarting the compute agent at this step gets the service back to its normal state. Closes-Bug: #1640177 Change-Id: I7392de43e933b1d16002e12e407748ae289d5e99
2016-11-08Reload haproxy configuration as a post-deployment stepCarlos Camacho1-3/+12
After deploying a fresh installed Overcloud or updating the stack the haproxy configuration is updated correctly but no change in the HA proxy stats happens. This submission will add the missing resources to run pre and post puppet tasks. Closes-bug: 1640175 Change-Id: I2f08704daeee502c618256695a30ce244a1d7ba5
2016-11-04Merge "Update openstack-puppet-modules dependencies"Jenkins1-1/+2
2016-11-04Merge "Fixup the start of swift services"Jenkins1-1/+1
2016-11-03Merge "Rework gnocchi-upgrade to run in a separate upgrade step"Jenkins5-18/+68
2016-11-03Fixup the start of swift servicesmarios1-1/+1
Seems the conditional has changed and we should pickup the tripleo::profile::base::swift::storage::enable_swift_storage hiera data. After controller nodes are upgraded the swift services were down even though there was no stand-alone swift node (the current conditional was failing as that hiera isn't set any more) Closes-Bug: 1638821 Change-Id: Id1383c1e54f9cae13fd375e90da525230e5d23eb
2016-11-01Update openstack-puppet-modules dependenciesLukas Bezdicka1-1/+2
OPM package is metadata package with unversioned requirements which means that update does not update the dependencies. This leaves us with old puppet modules and old puppet during the puppet run. Change-Id: I80f8a73142a09bb4178bb5a396d256ba81ba98a8 Closes-Bug: #1638266 Resolves: rhbz#1390559
2016-11-01Rework gnocchi-upgrade to run in a separate upgrade stepPradeep Kilambi5-18/+68
gnocchi when configured with swift will require keystone to be available to authenticate to migrate to v3. At this step keystone is not available and gnocchi upgrade fails with auth error. Instead start apache in step 3, start apache first and then run gnocchi upgrade in a separate step and let upgrade happen here. Closes-Bug: #1634897 Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0
2016-10-27Add replacepkgs to the manual ovs upgrade workaround and fix a typoMathieu Bultel6-16/+13
rpm command will return an exit 1 if ovs package is already there and will exit the step_1.sh script. To get around this force the update with --replacepkgs Also remove the \ just before the $ which cause a syntax error for the ceph storage Change-Id: I11fcf688982ceda5eef7afc8904afae44300c2d9 Closes-bug: 1636748
2016-10-25Merge "Fix the stonith property during upgrades"Jenkins1-4/+8
2016-10-22Fix the rabbitmq/redis pacemaker resource timeouts on updatesMichele Baldessari1-0/+19
With the following two changes we increased the timeout for redis and rabbit for both starting and stopping to 200s: https://review.openstack.org/386618 newton (merged) https://review.openstack.org/385555 master (merged) We want to also fix that on minor updates on all our supported releases upstream and downstream (newton, mitaka, liberty, kilo). This way we can guarantee that we have a uniform timeout for sart and stop for rabbit and redis across all our releases. Change-Id: If59bf3386832ee78d3a654f01077aff2e8be76e8 Closes-Bug: #1634851
2016-10-20Fix the stonith property during upgradesMichele Baldessari1-4/+8
We currently set the stonith property from all controller nodes during upgrade. This is racy and can actually end up disabling stonith after the upgrade even if when it was enabled. Let's set the property only from the bootstrap node. Change-Id: Id4afb867b485ac853be874a0179a7ed7cc914068 Closes-Bug: #1635294
2016-10-20Add special case handling for OVS upgrade in updates and upgradesmarios6-0/+86
This adds a special case handling for the opensvswitch package as discussed at the related bug below. This is added/handled here for both the minor update and the major mitaka...newton upgrade. Change-Id: I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae Closes-Bug: 1635205
2016-10-10Actually start the systemd services in step3 of the major-upgrade stepMichele Baldessari1-1/+1
We have the following function in the upgrade process after we updated the packages and called the db-sync commands: services=$(services_to_migrate) ... for service in $(services); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done The above is broken because $services contains a list of services to start, so $(services) will return gibberish and the for loop will never execute anything. One of the symptoms for this is the openstack-nova-compute service not restarting on the compute nodes during the yum -y upgrade. The reason for this is that during the service restart, nova-compute waits for nova-conductor to show up in the rabbitmq queues, which cannot happen since the service was actually never started. Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3 Closes-Bug: #1630580
2016-10-07Ceilometer Wsgi Mitaka->Newton upgradesPradeep Kilambi3-16/+166
In Newton, ceilometer api is changed to run under apache wsgi instead of eventlet. This will require upgrades for mitaka deployments to switch to wsgi. Closes-Bug: 1631297 Change-Id: If9d6987cd0a8fc5d3f9de518ba422d97d5149732
2016-10-05Adds Environment File for Removing Sahara during M/N upgrademarios3-4/+18
The default path if the operator does nothing is to keep the sahara services on mitaka to newton upgrades. If the operator wishes to remove sahara services then they need to specify the provided major-upgrade-remove-sahara.yaml environment file in the stack upgrade commands. The existing migration to ha arch already removes the constraints and pcs resource for sahara api/engine so we just need to stop it from starting again if we want to remove it. This adds a KeepSaharaServiceOnUpgrade parameter to determine if Sahara is disabled from starting up after the controllers are upgraded (defaults true). Finally it is worth noting that we default the sahara services as 'on' during converge here in the resource_registry of the converge environment file; any subsequent stack updates where the deployment contains sahara services will need to include the -e /environments/services/sahara.yaml environment file. Related-Bug: 1630247 Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
2016-10-04Merge "Set ceph osd max object name and namespace len on upgrade when on ext4"Jenkins1-0/+10
2016-10-03Merge "Update $service to $resource this variable does not exist in the context"Jenkins1-1/+1
2016-10-03Update $service to $resource this variable does not exist in the contextMathieu Bultel1-1/+1
heat failed due to a: service: unbound variable In the context $service is never set. Change-Id: If82ee4562612f2617b676732956396278ee40a88 Closes-Bug: #1629903
2016-10-03Change the rabbitmq ha policies during an M/N UpgradeMichele Baldessari2-1/+24
This takes care of the M->N upgrade path when changing the ha rabbitmq policy. Partial-Bug: #1628998 Change-Id: I2468a096b5d7042bc801a742a7a85fb1521c1c02
2016-09-29Merge "Use -L with chown and set crush map tunables when upgrading Ceph"Jenkins2-4/+8
2016-09-29Merge "Fix typo in fixing gnocchi upgrade."Jenkins1-1/+1
2016-09-29Set ceph osd max object name and namespace len on upgrade when on ext4Giulio Fidente1-0/+10
As per [1] we need to lower osd max object name and namespace len when upgrading from Hammer and the OSD is backed by ext4. These could also be given via ExtraConfig but on upgrade we only run puppet apply after this script is executed, so the values won't be effective unless the daemon is restarted. Yet we do not want puppet to restart the daemon because we can't bring all OSDs down unconditionally or guests will die. 1. http://tracker.ceph.com/issues/16187 Co-Authored-By: Michele Baldessari <michele@acksyn.org> Co-Authored-By: Dimitri Savineau <dsavinea@redhat.com> Change-Id: I7fec4e2426bdacd5f364adbebd42ab23dcfa523a Closes-Bug: 1628874
2016-09-29Merge "Relax pre-upgrade check for failed actions"Jenkins2-3/+5
2016-09-29Merge "Fix races in major-upgrade-pacemaker Step2"Jenkins3-17/+41
2016-09-29Fix typo in fixing gnocchi upgrade.Sofer Athlan-Guyot1-1/+1
Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482 Related-Bug: #1626592
2016-09-29Merge "Full HA->HA NG migration might fail setting maintenance-mode"Jenkins1-8/+4
2016-09-29Use -L with chown and set crush map tunables when upgrading CephGiulio Fidente2-4/+8
Previously the chown command wasn't traversing symlinks, causing the new ownership to not be set for some needed files. This change also ensures the crush map tunables are set to the 'default' profile after the upgrade. Finally redirects the output of a pidof to /dev/null to avoid spurious logging. Change-Id: Id4865ffff207edfc727d729f9cc04e6e81ad19d8
2016-09-29Relax pre-upgrade check for failed actionsMichele Baldessari2-3/+5
Before this change we checked the cluster for any failed actions and we stopped the upgrade process if there were any. This is likely eccessive as a failed action could have happened in the past and the cluster is now fully functional. Better to check if any of the resources are in Stopped state and break the upgrade process if any of them are. We also need to restrict this check to the bootstrap node because otherwise the following might happen: 1) Bootstrap node does the check, it is successful and it starts the full HA -> HA NG migration which *will* create failed actions and will start stopping resources 2) If the check now starts on a non-bootstrap node while 1) is ongoing, it will find either failed actions or stopped resources so it will fail. Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f Closes-Bug: #1628653
2016-09-29Fix races in major-upgrade-pacemaker Step2Michele Baldessari3-17/+41
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status |grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965
2016-09-28Update gnocchi database during M/N upgrade.Sofer Athlan-Guyot1-2/+3
We call gnocchi-upgrade to make sure we update all the needed schemas during the major-upgrade-pacemaker step. We also make sure that redis is started before we call gnocchi-upgrade otherwise the command will be stuck in a loop trying to contact redis. Closes-Bug: #1626592 Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
2016-09-28Full HA->HA NG migration might fail setting maintenance-modeMichele Baldessari1-8/+4
Currently we do the following in the migration path: pcs property set maintenance-mode=true if ! timeout -k 10 300 crm_resource --wait; then echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting." exit 1 fi crm_resource --wait can actually take forever under certain conditions. The property will be set atomically across the cluster nodes so we should be good without this. Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04 Closes-Bug: #1628393
2016-09-28Fix "Not all flavors have been migrated to the API database"Michele Baldessari1-0/+1
After a successful upgrade to Newton, I ran the tripleo.sh --overcloud-pingtest and it failed with the following: resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409) The issue is the fact that some tables have migrated to the nova_api db and we need to migrate the data as well. Currently we do: nova-manage db sync nova-manage api_db sync We want to add: nova-manage db online_data_migrations After launching this command the overcloud-pingtest works correctly: tripleo.sh -- Overcloud pingtest SUCCEEDED Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096 Closes-Bug: #1628450
2016-09-27Merge "Remove deprecated scheduler_driver settings"Jenkins1-0/+2