aboutsummaryrefslogtreecommitdiffstats
path: root/extraconfig/tasks
AgeCommit message (Collapse)AuthorFilesLines
2016-09-29Merge "Fix typo in fixing gnocchi upgrade."Jenkins1-1/+1
2016-09-29Set ceph osd max object name and namespace len on upgrade when on ext4Giulio Fidente1-0/+10
As per [1] we need to lower osd max object name and namespace len when upgrading from Hammer and the OSD is backed by ext4. These could also be given via ExtraConfig but on upgrade we only run puppet apply after this script is executed, so the values won't be effective unless the daemon is restarted. Yet we do not want puppet to restart the daemon because we can't bring all OSDs down unconditionally or guests will die. 1. http://tracker.ceph.com/issues/16187 Co-Authored-By: Michele Baldessari <michele@acksyn.org> Co-Authored-By: Dimitri Savineau <dsavinea@redhat.com> Change-Id: I7fec4e2426bdacd5f364adbebd42ab23dcfa523a Closes-Bug: 1628874
2016-09-29Merge "Relax pre-upgrade check for failed actions"Jenkins2-3/+5
2016-09-29Merge "Fix races in major-upgrade-pacemaker Step2"Jenkins3-17/+41
2016-09-29Fix typo in fixing gnocchi upgrade.Sofer Athlan-Guyot1-1/+1
Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482 Related-Bug: #1626592
2016-09-29Merge "Full HA->HA NG migration might fail setting maintenance-mode"Jenkins1-8/+4
2016-09-29Use -L with chown and set crush map tunables when upgrading CephGiulio Fidente2-4/+8
Previously the chown command wasn't traversing symlinks, causing the new ownership to not be set for some needed files. This change also ensures the crush map tunables are set to the 'default' profile after the upgrade. Finally redirects the output of a pidof to /dev/null to avoid spurious logging. Change-Id: Id4865ffff207edfc727d729f9cc04e6e81ad19d8
2016-09-29Relax pre-upgrade check for failed actionsMichele Baldessari2-3/+5
Before this change we checked the cluster for any failed actions and we stopped the upgrade process if there were any. This is likely eccessive as a failed action could have happened in the past and the cluster is now fully functional. Better to check if any of the resources are in Stopped state and break the upgrade process if any of them are. We also need to restrict this check to the bootstrap node because otherwise the following might happen: 1) Bootstrap node does the check, it is successful and it starts the full HA -> HA NG migration which *will* create failed actions and will start stopping resources 2) If the check now starts on a non-bootstrap node while 1) is ongoing, it will find either failed actions or stopped resources so it will fail. Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f Closes-Bug: #1628653
2016-09-29Fix races in major-upgrade-pacemaker Step2Michele Baldessari3-17/+41
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status |grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965
2016-09-28Update gnocchi database during M/N upgrade.Sofer Athlan-Guyot1-2/+3
We call gnocchi-upgrade to make sure we update all the needed schemas during the major-upgrade-pacemaker step. We also make sure that redis is started before we call gnocchi-upgrade otherwise the command will be stuck in a loop trying to contact redis. Closes-Bug: #1626592 Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
2016-09-28Full HA->HA NG migration might fail setting maintenance-modeMichele Baldessari1-8/+4
Currently we do the following in the migration path: pcs property set maintenance-mode=true if ! timeout -k 10 300 crm_resource --wait; then echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting." exit 1 fi crm_resource --wait can actually take forever under certain conditions. The property will be set atomically across the cluster nodes so we should be good without this. Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04 Closes-Bug: #1628393
2016-09-28Fix "Not all flavors have been migrated to the API database"Michele Baldessari1-0/+1
After a successful upgrade to Newton, I ran the tripleo.sh --overcloud-pingtest and it failed with the following: resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409) The issue is the fact that some tables have migrated to the nova_api db and we need to migrate the data as well. Currently we do: nova-manage db sync nova-manage api_db sync We want to add: nova-manage db online_data_migrations After launching this command the overcloud-pingtest works correctly: tripleo.sh -- Overcloud pingtest SUCCEEDED Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096 Closes-Bug: #1628450
2016-09-27Merge "Remove deprecated scheduler_driver settings"Jenkins1-0/+2
2016-09-27Merge "Disable openstack-cinder-volume in step1 and reenable it in step2"Jenkins2-0/+5
2016-09-27Merge "Fix ignore warning on ceph major upgrade."Jenkins1-1/+1
2016-09-27Merge "A few major-upgrade issues"Jenkins3-25/+46
2016-09-27Merge "Start mongod before calling ceilometer-dbsync"Jenkins1-0/+7
2016-09-27Merge "Reinstantiate parts of code that were accidentally removed"Jenkins2-0/+9
2016-09-26Fix ignore warning on ceph major upgrade.Sofer Athlan-Guyot1-1/+1
The paramater IgnoreCephUpgradeWarnings is type cast into a boolean which is rendered as 'True' or 'False' as a string not 'true' or 'false'. This fix the check. Change-Id: I8840c384d07f9d185a72bde5f91a3872a321f623 Closes-Bug: 1627736
2016-09-25get_param calls with multiple arguments need brackets around themMichele Baldessari2-5/+5
This issue was spotted during major upgrade where we had calls like this: servers: {get_param: servers, Controller} These get_param calls are hanging indefinitely and make the whole upgrade end in a timeout. We need to put brackets around the get_param function when there are multiple arguments: http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#get-param This is already done in most of the tree, and the few places where this was not happening were parts not under CI. After this change the following grep returns only one false positive: grep -ir get_param: |grep -v -- '\[' |grep ',' Change-Id: I65b23bb44f37b93e017dd15a5212939ffac76614 Closes-Bug: #1626628
2016-09-25A few major-upgrade issuesMichele Baldessari3-25/+46
This commit does the following: 1. We now explicitly disable/stop and then remove the resources that are moving to systemd. We do this because we want to make sure they are all stopped before doing a yum upgrade, which otherwise would take ages due to rabbitmq and galera being down. It is best if we do this via pcs while we do the HA Full -> HA NG migration because it is simpler to make sure all the services are stopped at that stage. For extra safety we can still do a check by hand. By doing it via pacemaker we have the guarantee that all the migrated services are down already when we stop the cluster (which happens to be a syncronization point between all controller nodes). That way we can be certain that they are all down on all nodes before starting the yum upgrade process. 2. We actually need to start the systemd services in major_upgrade_controller_pacemaker_2.sh and not stop them. 3. We need to use the proper bash variable name 4. Use is_bootstrap_node everywhere to make the code more consistent Change-Id: Ic565c781b80357bed9483df45a4a94ec0423487c Closes-Bug: #1627490
2016-09-25Disable openstack-cinder-volume in step1 and reenable it in step2Michele Baldessari2-0/+5
Currently we do not disable openstack-cinder-volume during our major-upgrade-pacemaker step. This leads to the following scenario. In major_upgrade_controller_pacemaker_2.sh we do: start_or_enable_service galera check_resource galera started 600 .... if [[ -n $(is_bootstrap_node) ]]; then ... cinder-manage db sync ... What happens here is that since openstack-cinder-volume was never disabled it will already be started by pacemaker before we call cinder-manage and this will give us the following errors during the start: 06:05:21.861 19482 ERROR cinder.cmd.volume DBError: (pymysql.err.InternalError) (1054, u"Unknown column 'services.cluster_name' in 'field list'") Change-Id: I01b2daf956c30b9a4985ea62cbf4c941ec66dcdf Closes-Bug: #1627470
2016-09-25Start mongod before calling ceilometer-dbsyncMichele Baldessari1-0/+7
Currently we in major_upgrade_controller_pacemaker_2.sh we are calling ceilometer-dbsync before mongod is actually started (only galera is started at this point). This will make the dbsync hang indefinitely until the heat stack times out. Now this approach should be okay, but do note that when we start mongod via systemctl we are not guaranteed that it will be up on all nodes before we call ceilometer-dbsync. This *should* be okay because ceilometer-dbsync keeps retrying and eventually one of the nodes will be available. A completely clean fix here would be to add another step in heat to have the guarantee that all mongo servers are up and running before the dbsync call. Change-Id: I10c960b1e0efdeb1e55d77c25aebf1e3e67f17ca Closes-Bug: #1627453
2016-09-25Remove deprecated scheduler_driver settingsMichele Baldessari1-0/+2
In bug https://bugs.launchpad.net/tripleo/+bug/1615035 we fixed the scheduler_host setting which got deprecated in newton. It seems also the scheduler_driver settings needs tweaking: systemctl status openstack-nova-scheduler.service: 2016-09-24 20:24:54.337 15278 WARNING stevedore.named [-] Could not load nova.scheduler.filter_scheduler.FilterScheduler 2016-09-24 20:24:54.338 15278 CRITICAL nova [-] RuntimeError: (u'Cannot load scheduler driver from configuration %(conf)s.', {'conf': 'nova.scheduler.filter_scheduler.FilterScheduler'}) Let's set this to default during the upgrade step. From newton's nova.conf: The class of the driver used by the scheduler. This should be chosen from one of the entrypoints under the namespace 'nova.scheduler.driver' of file 'setup.cfg'. If nothing is specified in this option, the 'filter_scheduler' is used. This option also supports deprecated full Python path to the class to be used. For example, "nova.scheduler.filter_scheduler.FilterScheduler". But note: this support will be dropped in the N Release. Change-Id: Ic384292ad05a57757158995ec4c1a269fe4b00f1 Depends-On: I89124ead8928ff33e6b6907a7c2178169e91f4e6 Closes-Bug: #1627450
2016-09-25Reinstantiate parts of code that were accidentally removedMichele Baldessari2-0/+9
With commit fb25385d34e604d2f670cebe3e03fd57c14fa6be "Rework the pacemaker_common_functions for M..N upgrades" we accidentally removed some lines that fixed M/N upgrade issues. Namely: extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh -# https://bugzilla.redhat.com/show_bug.cgi?id=1284047 -# Change-Id: Ib3f6c12ff5471e1f017f28b16b1e6496a4a4b435 -crudini --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend rabbit -# https://bugzilla.redhat.com/show_bug.cgi?id=1284058 -# Ifd1861e3df46fad0e44ff9b5cbd58711bbc87c97 Swift Ceilometer middleware no longer exists -crudini --set /etc/swift/proxy-server.conf pipeline:main pipeline "catch_errors healthcheck cache ratelimit tempurl formpost authtoken keystone staticweb proxy-logging proxy-server" -# LP: 1615035, required only for M/N upgrade. -crudini --set /etc/nova/nova.conf DEFAULT scheduler_host_manager host_manager extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh nova-manage db sync - nova-manage api_db sync This patch simply puts that code back without reverting the whole commit that broke things, because that is needed. Closes-Bug: #1627448 Change-Id: I89124ead8928ff33e6b6907a7c2178169e91f4e6
2016-09-21Make sure major upgrade script fails.Sofer Athlan-Guyot2-0/+3
Running upgrade-non-controller.sh against compute and object storage did not fail if the /root/tripleo_upgrade_node.sh failed. This make it harder to detect error in CI system for instance. Change-Id: I12b7d640547d3b8ec1f70104d159d6052b7638ff Closes-Bug: 1620973
2016-09-19Add a function to upgrade from full HA to NG HAMichele Baldessari3-16/+137
This is the initial work to have a function that migrates a full HA architecture as deployed in Mitaka to the HA architecture as deployed in Newton where only a few resources are managed by pacemaker. The sequence is the following: 1) We remove the desired services from pacemaker's control. The services at this point are still running normally via the systemd service as invoked by pacemaker 2) We do a "systemctl stop <service>" on all controllers for all the services that were removed from pacemaker's control. We do this to make sure that during the yum upgrade, the %post sections that call "systemctl try-restart" do not take ages, because at this point during the upgrade rabbit is down. The only exceptions are "openstack-core" and "delay" which are dummy pacemaker resources that do not exist on the system 3) We do a "systemctl start <service>" on all nodes for all the services mentioned above. We should probably merge this patch only when newton has branched as it is very specific to the M/N upgrade. Closes-Bug: 1617520 Change-Id: I4c409ce58c1a57b6e0decc3cf168b62698b32e39
2016-09-17M/N upgrade sahara-api fails to restart.Sofer Athlan-Guyot1-0/+2
Change-Id: I7a041dab8b1b1edc9c80248e1eef3ce7ab272292 Closes-Bug: 1615056
2016-09-17Rework the pacemaker_common_functions for M..N upgradesmarios4-61/+290
For N we cannot assume services are managed by pacemaker. This adds functions to check if a service is systemd or pcmk managed and start/stops it accordingly. For pcmk, only stop/disable on bootstrap node for example, whereas systemd should stop/start on all controllers. There is also an equivalent change to the check_resource which has been reworked to allow both pcmk and systemd. Implements: blueprint overcloud-upgrades-workflow-mitaka-to-newton Change-Id: Ic8252736781dc906b3aef8fc756eb8b2f3bb1f02
2016-09-16Merge "Refactor upgrade checks."Jenkins3-62/+111
2016-09-16Merge "Convert UpdateWorkflow to support composable roles"Jenkins3-84/+24
2016-09-16Merge "Fix use of batch_create in CephMon major upgrade template"Jenkins1-1/+2
2016-09-16Fix use of batch_create in CephMon major upgrade templateMathieu Bultel1-1/+2
The batch_create and rolling_update keys were incorrectly defined as properties of the resource instead of update policies. Change-Id: I19261adc78e4cdc3616f16221e85490a6b48d47b Closes-Bug: 1623506
2016-09-16Fixes the Ceph upgrade scriptsGiulio Fidente2-5/+5
The Ceph upgrade scripts was failing on the following: 1. a syntax error in an if condition 2. an attempt to read a possibly unbound variable 3. an attempt to chown a directory which might not exist this change aims at fixing all of the above. Closes-Bug: 1623942 Change-Id: I9e9d63d4ab7626893aaf2a25dccfcafbb97ccbdf
2016-09-16Convert UpdateWorkflow to support composable rolesSteven Hardy3-84/+24
We need to remove the hard-coded roles from overcloud.j2.yaml as now it's valid to e.g remove BlockStorage completely. The previous behavior for the per-role upgrade scripts is maintained but we'll need to rework this for newton->ocata upgrades where we can no longer be sure the servers mapping will contain all roles. Change-Id: I25e6c84757e3c00fba2aae834cd8206c62e44acf Partially-Implements: blueprint custom-roles
2016-09-12Refactor upgrade checks.Sofer Athlan-Guyot3-62/+111
We make it clear that recoverable checks happen before starting the upgrade to be able to run the upgrade after the offending error has been manually corrected. Add new check for the pcsd cluster status. Add new check for galera password file: BZ 1357112 Closes-Bug: 1614907 Change-Id: If736c79121e1ffe0eaeb814bdb73ccbc0b64edcd
2016-09-09Merge "Add Ceph cluster health validation on upgrade"Jenkins2-4/+32
2016-09-02Merge "Upgrade ceph-osd"Jenkins1-10/+67
2016-09-01Merge "Restart only services that need it"Jenkins1-8/+16
2016-08-31Restart only services that need itJiri Stransky1-8/+16
With new pacemaker architecture, Puppet handles restarts of most of the services. There are several still managed by pacemaker which need special restart handling utilizing pacemaker and its resource agents. The counterpart in puppet-tripleo requests restarts for individual pacemaker-managed services by writing out "restart flag" files, and the pacemaker_resource_restart.sh script then performs the restarts. Change-Id: Ia4e6a9f88181f1981993f046cf415dbbcdc9570e Closes-Bug: #1614967 Closes-Bug: #1587015 Depends-On: I6369ab0c82dbf3c8f21043f8aa9ab810744ddc12
2016-08-30Merge "M/N upgrade fix galera restart."Jenkins1-11/+16
2016-08-30Add Ceph cluster health validation on upgradeGiulio Fidente2-4/+32
This will prevent the Ceph Mon upgrade script from starting if the Ceph cluster is in error state. It also adds a parameter to ignore warning states, useful when performing an upgrade of a cluster where the number of healthy OSDs does not guarantee the desired replica size. Closes-Bug: 1618533 Change-Id: I1beb8ad0812f19b1018ba19b5a9fc85fa132d7f7
2016-08-30Upgrade ceph-osdGiulio Fidente1-10/+67
Upgrades the ceph-osd daemon from Hammer to Jewel Change-Id: Idfa90fdc0052c53f448401c85c5d13a2ba68acd1
2016-08-30Merge "Upgrade ceph-mon"Jenkins2-0/+81
2016-08-30Merge "Fix check of rpm-python."Jenkins1-1/+1
2016-08-29Upgrade ceph-monGiulio Fidente2-0/+81
Adds a pre-requisite software deployment to the pacemaker scenario upgrade which, before the openstack services are upgraded, upgrades the ceph-mon daemon from Hammer to Jewel. Change-Id: I9855d80a6ae156b4a9e0409c3c927068b9db95a0
2016-08-29Merge "M/N upgrade set scheduler_host_manager right."Jenkins1-0/+2
2016-08-26Merge "M/N upgrade fail to restart nova-scheduler."Jenkins1-0/+1
2016-08-26M/N upgrade fix galera restart.Sofer Athlan-Guyot1-11/+16
We have to recreate the /var/lib/mysql directory on all controller node, not just the boostrap node for the cluster to be able to restart. Adding a warning on the fact that those script are local and know nothing about the good upgrade state of the other nodes. Closes-Bug: 1612642 Change-Id: I48e2812d7df80bbf2db53a8b71dc434d4209a160
2016-08-26Fix check of rpm-python.Sofer Athlan-Guyot1-1/+1
There is a typo in the code, making this test always successful. Closes-Bug: 1614437 Change-Id: Ia6b0b156294de9fcb8f66fc46aa8801555775a56