apex-tripleo-heat-templates - Unnamed repository

Age	Commit message (Collapse)	Author	Files	Lines
2016-11-10	Merge "Fix race during major-upgrade-pacemaker step"	Jenkins	8	-263/+315

2016-11-09	Merge "Reload haproxy configuration as a post-deployment step"	Jenkins	1	-3/+12

2016-11-09	Fix race during major-upgrade-pacemaker step	Michele Baldessari	8	-263/+315
	Currently when we call the major-upgrade step we do the following: """ ... if [[ -n $(is_bootstrap_node) ]]; then check_clean_cluster fi ... if [[ -n $(is_bootstrap_node) ]]; then migrate_full_to_ng_ha fi ... for service in $(services_to_migrate); do manage_systemd_service stop "${service%%-clone}" ... done """ The problem with the above code is that it is open to the following race condition: 1. Code gets run first on a non-bootstrap controller node so we start stopping a bunch of services 2. Pacemaker notices will notice that services are down and will mark the service as stopped 3. Code gets run on the bootstrap node (controller-0) and the check_clean_cluster function will fail and exit 4. Eventually also the script on the non-bootstrap controller node will timeout and exit because the cluster never shut down (it never actually started the shutdown because we failed at 3) Let's make sure we first only call the HA NG migration step as a separate heat step. Only afterwards we start shutting down the systemd services on all nodes. We also need to move the STONITH_STATE variable into a file because it is being used across two different scripts (1 and 2) and we need to store that state. Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com> Closes-Bug: #1640407 Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
2016-11-08	ceilometer compute agent needs restart on compute upgrade	Pradeep Kilambi	1	-0/+4
	After compute nodes are upgraded, the ceilometer compute agent doesnt poll and throws warnings. Restarting the compute agent at this step gets the service back to its normal state. Closes-Bug: #1640177 Change-Id: I7392de43e933b1d16002e12e407748ae289d5e99
2016-11-08	Reload haproxy configuration as a post-deployment step	Carlos Camacho	1	-3/+12
	After deploying a fresh installed Overcloud or updating the stack the haproxy configuration is updated correctly but no change in the HA proxy stats happens. This submission will add the missing resources to run pre and post puppet tasks. Closes-bug: 1640175 Change-Id: I2f08704daeee502c618256695a30ce244a1d7ba5
2016-11-04	Merge "Update openstack-puppet-modules dependencies"	Jenkins	1	-1/+2

2016-11-04	Merge "Fixup the start of swift services"	Jenkins	1	-1/+1

2016-11-03	Merge "Rework gnocchi-upgrade to run in a separate upgrade step"	Jenkins	5	-18/+68

2016-11-03	Fixup the start of swift services	marios	1	-1/+1
	Seems the conditional has changed and we should pickup the tripleo::profile::base::swift::storage::enable_swift_storage hiera data. After controller nodes are upgraded the swift services were down even though there was no stand-alone swift node (the current conditional was failing as that hiera isn't set any more) Closes-Bug: 1638821 Change-Id: Id1383c1e54f9cae13fd375e90da525230e5d23eb
2016-11-01	Update openstack-puppet-modules dependencies	Lukas Bezdicka	1	-1/+2
	OPM package is metadata package with unversioned requirements which means that update does not update the dependencies. This leaves us with old puppet modules and old puppet during the puppet run. Change-Id: I80f8a73142a09bb4178bb5a396d256ba81ba98a8 Closes-Bug: #1638266 Resolves: rhbz#1390559
2016-11-01	Rework gnocchi-upgrade to run in a separate upgrade step	Pradeep Kilambi	5	-18/+68
	gnocchi when configured with swift will require keystone to be available to authenticate to migrate to v3. At this step keystone is not available and gnocchi upgrade fails with auth error. Instead start apache in step 3, start apache first and then run gnocchi upgrade in a separate step and let upgrade happen here. Closes-Bug: #1634897 Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0
2016-10-27	Add replacepkgs to the manual ovs upgrade workaround and fix a typo	Mathieu Bultel	6	-16/+13
	rpm command will return an exit 1 if ovs package is already there and will exit the step_1.sh script. To get around this force the update with --replacepkgs Also remove the \ just before the $ which cause a syntax error for the ceph storage Change-Id: I11fcf688982ceda5eef7afc8904afae44300c2d9 Closes-bug: 1636748
2016-10-25	Merge "Fix the stonith property during upgrades"	Jenkins	1	-4/+8

2016-10-22	Fix the rabbitmq/redis pacemaker resource timeouts on updates	Michele Baldessari	1	-0/+19
	With the following two changes we increased the timeout for redis and rabbit for both starting and stopping to 200s: https://review.openstack.org/386618 newton (merged) https://review.openstack.org/385555 master (merged) We want to also fix that on minor updates on all our supported releases upstream and downstream (newton, mitaka, liberty, kilo). This way we can guarantee that we have a uniform timeout for sart and stop for rabbit and redis across all our releases. Change-Id: If59bf3386832ee78d3a654f01077aff2e8be76e8 Closes-Bug: #1634851
2016-10-20	Fix the stonith property during upgrades	Michele Baldessari	1	-4/+8
	We currently set the stonith property from all controller nodes during upgrade. This is racy and can actually end up disabling stonith after the upgrade even if when it was enabled. Let's set the property only from the bootstrap node. Change-Id: Id4afb867b485ac853be874a0179a7ed7cc914068 Closes-Bug: #1635294
2016-10-20	Add special case handling for OVS upgrade in updates and upgrades	marios	6	-0/+86
	This adds a special case handling for the opensvswitch package as discussed at the related bug below. This is added/handled here for both the minor update and the major mitaka...newton upgrade. Change-Id: I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae Closes-Bug: 1635205
2016-10-10	Actually start the systemd services in step3 of the major-upgrade step	Michele Baldessari	1	-1/+1
	We have the following function in the upgrade process after we updated the packages and called the db-sync commands: services=$(services_to_migrate) ... for service in $(services); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done The above is broken because $services contains a list of services to start, so $(services) will return gibberish and the for loop will never execute anything. One of the symptoms for this is the openstack-nova-compute service not restarting on the compute nodes during the yum -y upgrade. The reason for this is that during the service restart, nova-compute waits for nova-conductor to show up in the rabbitmq queues, which cannot happen since the service was actually never started. Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3 Closes-Bug: #1630580
2016-10-07	Ceilometer Wsgi Mitaka->Newton upgrades	Pradeep Kilambi	3	-16/+166
	In Newton, ceilometer api is changed to run under apache wsgi instead of eventlet. This will require upgrades for mitaka deployments to switch to wsgi. Closes-Bug: 1631297 Change-Id: If9d6987cd0a8fc5d3f9de518ba422d97d5149732
2016-10-05	Adds Environment File for Removing Sahara during M/N upgrade	marios	3	-4/+18
	The default path if the operator does nothing is to keep the sahara services on mitaka to newton upgrades. If the operator wishes to remove sahara services then they need to specify the provided major-upgrade-remove-sahara.yaml environment file in the stack upgrade commands. The existing migration to ha arch already removes the constraints and pcs resource for sahara api/engine so we just need to stop it from starting again if we want to remove it. This adds a KeepSaharaServiceOnUpgrade parameter to determine if Sahara is disabled from starting up after the controllers are upgraded (defaults true). Finally it is worth noting that we default the sahara services as 'on' during converge here in the resource_registry of the converge environment file; any subsequent stack updates where the deployment contains sahara services will need to include the -e /environments/services/sahara.yaml environment file. Related-Bug: 1630247 Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
2016-10-04	Merge "Set ceph osd max object name and namespace len on upgrade when on ext4"	Jenkins	1	-0/+10

2016-10-03	Merge "Update $service to $resource this variable does not exist in the context"	Jenkins	1	-1/+1

2016-10-03	Update $service to $resource this variable does not exist in the context	Mathieu Bultel	1	-1/+1
	heat failed due to a: service: unbound variable In the context $service is never set. Change-Id: If82ee4562612f2617b676732956396278ee40a88 Closes-Bug: #1629903
2016-10-03	Change the rabbitmq ha policies during an M/N Upgrade	Michele Baldessari	2	-1/+24
	This takes care of the M->N upgrade path when changing the ha rabbitmq policy. Partial-Bug: #1628998 Change-Id: I2468a096b5d7042bc801a742a7a85fb1521c1c02
2016-09-29	Merge "Use -L with chown and set crush map tunables when upgrading Ceph"	Jenkins	2	-4/+8

2016-09-29	Merge "Fix typo in fixing gnocchi upgrade."	Jenkins	1	-1/+1

2016-09-29	Set ceph osd max object name and namespace len on upgrade when on ext4	Giulio Fidente	1	-0/+10
	As per [1] we need to lower osd max object name and namespace len when upgrading from Hammer and the OSD is backed by ext4. These could also be given via ExtraConfig but on upgrade we only run puppet apply after this script is executed, so the values won't be effective unless the daemon is restarted. Yet we do not want puppet to restart the daemon because we can't bring all OSDs down unconditionally or guests will die. 1. http://tracker.ceph.com/issues/16187 Co-Authored-By: Michele Baldessari <michele@acksyn.org> Co-Authored-By: Dimitri Savineau <dsavinea@redhat.com> Change-Id: I7fec4e2426bdacd5f364adbebd42ab23dcfa523a Closes-Bug: 1628874
2016-09-29	Merge "Relax pre-upgrade check for failed actions"	Jenkins	2	-3/+5

2016-09-29	Merge "Fix races in major-upgrade-pacemaker Step2"	Jenkins	3	-17/+41

2016-09-29	Fix typo in fixing gnocchi upgrade.	Sofer Athlan-Guyot	1	-1/+1
	Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482 Related-Bug: #1626592
2016-09-29	Merge "Full HA->HA NG migration might fail setting maintenance-mode"	Jenkins	1	-8/+4

2016-09-29	Use -L with chown and set crush map tunables when upgrading Ceph	Giulio Fidente	2	-4/+8
	Previously the chown command wasn't traversing symlinks, causing the new ownership to not be set for some needed files. This change also ensures the crush map tunables are set to the 'default' profile after the upgrade. Finally redirects the output of a pidof to /dev/null to avoid spurious logging. Change-Id: Id4865ffff207edfc727d729f9cc04e6e81ad19d8
2016-09-29	Relax pre-upgrade check for failed actions	Michele Baldessari	2	-3/+5
	Before this change we checked the cluster for any failed actions and we stopped the upgrade process if there were any. This is likely eccessive as a failed action could have happened in the past and the cluster is now fully functional. Better to check if any of the resources are in Stopped state and break the upgrade process if any of them are. We also need to restrict this check to the bootstrap node because otherwise the following might happen: 1) Bootstrap node does the check, it is successful and it starts the full HA -> HA NG migration which will create failed actions and will start stopping resources 2) If the check now starts on a non-bootstrap node while 1) is ongoing, it will find either failed actions or stopped resources so it will fail. Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f Closes-Bug: #1628653
2016-09-29	Fix races in major-upgrade-pacemaker Step2	Michele Baldessari	3	-17/+41
	tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status \|grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965
2016-09-28	Update gnocchi database during M/N upgrade.	Sofer Athlan-Guyot	1	-2/+3
	We call gnocchi-upgrade to make sure we update all the needed schemas during the major-upgrade-pacemaker step. We also make sure that redis is started before we call gnocchi-upgrade otherwise the command will be stuck in a loop trying to contact redis. Closes-Bug: #1626592 Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
2016-09-28	Full HA->HA NG migration might fail setting maintenance-mode	Michele Baldessari	1	-8/+4
	Currently we do the following in the migration path: pcs property set maintenance-mode=true if ! timeout -k 10 300 crm_resource --wait; then echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting." exit 1 fi crm_resource --wait can actually take forever under certain conditions. The property will be set atomically across the cluster nodes so we should be good without this. Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04 Closes-Bug: #1628393
2016-09-28	Fix "Not all flavors have been migrated to the API database"	Michele Baldessari	1	-0/+1
	After a successful upgrade to Newton, I ran the tripleo.sh --overcloud-pingtest and it failed with the following: resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409) The issue is the fact that some tables have migrated to the nova_api db and we need to migrate the data as well. Currently we do: nova-manage db sync nova-manage api_db sync We want to add: nova-manage db online_data_migrations After launching this command the overcloud-pingtest works correctly: tripleo.sh -- Overcloud pingtest SUCCEEDED Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096 Closes-Bug: #1628450
2016-09-27	Merge "Remove deprecated scheduler_driver settings"	Jenkins	1	-0/+2

2016-09-27	Merge "Disable openstack-cinder-volume in step1 and reenable it in step2"	Jenkins	2	-0/+5

2016-09-27	Merge "Fix ignore warning on ceph major upgrade."	Jenkins	1	-1/+1

2016-09-27	Merge "A few major-upgrade issues"	Jenkins	3	-25/+46

2016-09-27	Merge "Start mongod before calling ceilometer-dbsync"	Jenkins	1	-0/+7

2016-09-27	Merge "Reinstantiate parts of code that were accidentally removed"	Jenkins	2	-0/+9

2016-09-26	Fix ignore warning on ceph major upgrade.	Sofer Athlan-Guyot	1	-1/+1
	The paramater IgnoreCephUpgradeWarnings is type cast into a boolean which is rendered as 'True' or 'False' as a string not 'true' or 'false'. This fix the check. Change-Id: I8840c384d07f9d185a72bde5f91a3872a321f623 Closes-Bug: 1627736
2016-09-25	get_param calls with multiple arguments need brackets around them	Michele Baldessari	6	-11/+11
	This issue was spotted during major upgrade where we had calls like this: servers: {get_param: servers, Controller} These get_param calls are hanging indefinitely and make the whole upgrade end in a timeout. We need to put brackets around the get_param function when there are multiple arguments: http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#get-param This is already done in most of the tree, and the few places where this was not happening were parts not under CI. After this change the following grep returns only one false positive: grep -ir get_param: \|grep -v -- '\[' \|grep ',' Change-Id: I65b23bb44f37b93e017dd15a5212939ffac76614 Closes-Bug: #1626628
2016-09-25	A few major-upgrade issues	Michele Baldessari	3	-25/+46
	This commit does the following: 1. We now explicitly disable/stop and then remove the resources that are moving to systemd. We do this because we want to make sure they are all stopped before doing a yum upgrade, which otherwise would take ages due to rabbitmq and galera being down. It is best if we do this via pcs while we do the HA Full -> HA NG migration because it is simpler to make sure all the services are stopped at that stage. For extra safety we can still do a check by hand. By doing it via pacemaker we have the guarantee that all the migrated services are down already when we stop the cluster (which happens to be a syncronization point between all controller nodes). That way we can be certain that they are all down on all nodes before starting the yum upgrade process. 2. We actually need to start the systemd services in major_upgrade_controller_pacemaker_2.sh and not stop them. 3. We need to use the proper bash variable name 4. Use is_bootstrap_node everywhere to make the code more consistent Change-Id: Ic565c781b80357bed9483df45a4a94ec0423487c Closes-Bug: #1627490
2016-09-25	Disable openstack-cinder-volume in step1 and reenable it in step2	Michele Baldessari	2	-0/+5
	Currently we do not disable openstack-cinder-volume during our major-upgrade-pacemaker step. This leads to the following scenario. In major_upgrade_controller_pacemaker_2.sh we do: start_or_enable_service galera check_resource galera started 600 .... if [[ -n $(is_bootstrap_node) ]]; then ... cinder-manage db sync ... What happens here is that since openstack-cinder-volume was never disabled it will already be started by pacemaker before we call cinder-manage and this will give us the following errors during the start: 06:05:21.861 19482 ERROR cinder.cmd.volume DBError: (pymysql.err.InternalError) (1054, u"Unknown column 'services.cluster_name' in 'field list'") Change-Id: I01b2daf956c30b9a4985ea62cbf4c941ec66dcdf Closes-Bug: #1627470
2016-09-25	Start mongod before calling ceilometer-dbsync	Michele Baldessari	1	-0/+7
	Currently we in major_upgrade_controller_pacemaker_2.sh we are calling ceilometer-dbsync before mongod is actually started (only galera is started at this point). This will make the dbsync hang indefinitely until the heat stack times out. Now this approach should be okay, but do note that when we start mongod via systemctl we are not guaranteed that it will be up on all nodes before we call ceilometer-dbsync. This should be okay because ceilometer-dbsync keeps retrying and eventually one of the nodes will be available. A completely clean fix here would be to add another step in heat to have the guarantee that all mongo servers are up and running before the dbsync call. Change-Id: I10c960b1e0efdeb1e55d77c25aebf1e3e67f17ca Closes-Bug: #1627453
2016-09-25	Remove deprecated scheduler_driver settings	Michele Baldessari	1	-0/+2
	In bug https://bugs.launchpad.net/tripleo/+bug/1615035 we fixed the scheduler_host setting which got deprecated in newton. It seems also the scheduler_driver settings needs tweaking: systemctl status openstack-nova-scheduler.service: 2016-09-24 20:24:54.337 15278 WARNING stevedore.named [-] Could not load nova.scheduler.filter_scheduler.FilterScheduler 2016-09-24 20:24:54.338 15278 CRITICAL nova [-] RuntimeError: (u'Cannot load scheduler driver from configuration %(conf)s.', {'conf': 'nova.scheduler.filter_scheduler.FilterScheduler'}) Let's set this to default during the upgrade step. From newton's nova.conf: The class of the driver used by the scheduler. This should be chosen from one of the entrypoints under the namespace 'nova.scheduler.driver' of file 'setup.cfg'. If nothing is specified in this option, the 'filter_scheduler' is used. This option also supports deprecated full Python path to the class to be used. For example, "nova.scheduler.filter_scheduler.FilterScheduler". But note: this support will be dropped in the N Release. Change-Id: Ic384292ad05a57757158995ec4c1a269fe4b00f1 Depends-On: I89124ead8928ff33e6b6907a7c2178169e91f4e6 Closes-Bug: #1627450
2016-09-25	Reinstantiate parts of code that were accidentally removed	Michele Baldessari	2	-0/+9
	With commit fb25385d34e604d2f670cebe3e03fd57c14fa6be "Rework the pacemaker_common_functions for M..N upgrades" we accidentally removed some lines that fixed M/N upgrade issues. Namely: extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh -# https://bugzilla.redhat.com/show_bug.cgi?id=1284047 -# Change-Id: Ib3f6c12ff5471e1f017f28b16b1e6496a4a4b435 -crudini --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend rabbit -# https://bugzilla.redhat.com/show_bug.cgi?id=1284058 -# Ifd1861e3df46fad0e44ff9b5cbd58711bbc87c97 Swift Ceilometer middleware no longer exists -crudini --set /etc/swift/proxy-server.conf pipeline:main pipeline "catch_errors healthcheck cache ratelimit tempurl formpost authtoken keystone staticweb proxy-logging proxy-server" -# LP: 1615035, required only for M/N upgrade. -crudini --set /etc/nova/nova.conf DEFAULT scheduler_host_manager host_manager extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh nova-manage db sync - nova-manage api_db sync This patch simply puts that code back without reverting the whole commit that broke things, because that is needed. Closes-Bug: #1627448 Change-Id: I89124ead8928ff33e6b6907a7c2178169e91f4e6
2016-09-21	Make sure major upgrade script fails.	Sofer Athlan-Guyot	2	-0/+3
	Running upgrade-non-controller.sh against compute and object storage did not fail if the /root/tripleo_upgrade_node.sh failed. This make it harder to detect error in CI system for instance. Change-Id: I12b7d640547d3b8ec1f70104d159d6052b7638ff Closes-Bug: 1620973