aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2016-09-29Balance Rabbitmq Queue Master Location on queue declaration with min-masters ↵Michele Baldessari1-0/+1
strategy It may happen that one of the controllers may become unavailable and Queue Masters will be located on available controllers during queue declarations. Once a lost controller will be become available masters of newly declared queues are not placed with priority to such controller with obviously lower number of queue masters and thus the distribution may be unbalanced and one of the controllers may become under significantly higher load in some circumstances of multiple fail-overs. With rabbit 3.6.0 rabbitmq introduced a new HA feature of Queue masters distribution - one of the strategies is min-masters, which picks the node hosting the minimum number of masters. One of the ways how to turn such min-masters strategy on is by adding following into configuration file - rabbitmq.config {rabbit,[ .. {queue_master_locator, <<"min-masters">>}, .. ]}, Change-Id: I61bcab0e93027282b62f2a97bec87cbb0a6e6551 Closes-Bug: #1629010
2016-09-29Merge "j2 template per-role things in default registry"Jenkins1-58/+20
2016-09-29Merge "Relax pre-upgrade check for failed actions"Jenkins2-3/+5
2016-09-29Merge "Fix races in major-upgrade-pacemaker Step2"Jenkins3-17/+41
2016-09-29Merge "Full HA->HA NG migration might fail setting maintenance-mode"Jenkins1-8/+4
2016-09-29Merge "Update gnocchi database during M/N upgrade."Jenkins1-2/+3
2016-09-29Merge "Move db::mysql into service_config_settings"Jenkins23-105/+111
2016-09-29j2 template per-role things in default registrySteven Hardy1-58/+20
The default resource-registry file contains a bunch of per-role things which mean you need to cut/paste into a custom environment file for custom roles, even if you only want the defaults like the built-in roles. Using j2 we can template these just like in the overcloud.j2.yaml and other files. Change-Id: I52a9bffd043ca8fb0f05077c8a401a68def82926 Partial-Bug: #1626976
2016-09-29Relax pre-upgrade check for failed actionsMichele Baldessari2-3/+5
Before this change we checked the cluster for any failed actions and we stopped the upgrade process if there were any. This is likely eccessive as a failed action could have happened in the past and the cluster is now fully functional. Better to check if any of the resources are in Stopped state and break the upgrade process if any of them are. We also need to restrict this check to the bootstrap node because otherwise the following might happen: 1) Bootstrap node does the check, it is successful and it starts the full HA -> HA NG migration which *will* create failed actions and will start stopping resources 2) If the check now starts on a non-bootstrap node while 1) is ongoing, it will find either failed actions or stopped resources so it will fail. Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f Closes-Bug: #1628653
2016-09-29Fix races in major-upgrade-pacemaker Step2Michele Baldessari3-17/+41
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status |grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965
2016-09-28Update gnocchi database during M/N upgrade.Sofer Athlan-Guyot1-2/+3
We call gnocchi-upgrade to make sure we update all the needed schemas during the major-upgrade-pacemaker step. We also make sure that redis is started before we call gnocchi-upgrade otherwise the command will be stuck in a loop trying to contact redis. Closes-Bug: #1626592 Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
2016-09-28Merge "Fix predictable placement indexing"Jenkins1-0/+14
2016-09-28Move db::mysql into service_config_settingsDan Prince23-105/+111
This patch movs the various db::mysql hiera settings into a 'mysql' specific service_config_settings section for each service so that these will only get applied on the MySQL service node. This follows a similar puppet-tripleo change where we create the actual databases for all services locally on the MySQL service node to avoid permission issues. Change-Id: Ic0692b1f7aa8409699630ef3924c4be98ca6ffb2 Closes-bug: #1620595 Depends-On: I05cc0afa9373429a3197c194c3e8f784ae96de5f Depends-On: I5e1ef2dc6de6f67d7c509e299855baec371f614d
2016-09-28Full HA->HA NG migration might fail setting maintenance-modeMichele Baldessari1-8/+4
Currently we do the following in the migration path: pcs property set maintenance-mode=true if ! timeout -k 10 300 crm_resource --wait; then echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting." exit 1 fi crm_resource --wait can actually take forever under certain conditions. The property will be set atomically across the cluster nodes so we should be good without this. Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04 Closes-Bug: #1628393
2016-09-28Fix "Not all flavors have been migrated to the API database"Michele Baldessari1-0/+1
After a successful upgrade to Newton, I ran the tripleo.sh --overcloud-pingtest and it failed with the following: resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409) The issue is the fact that some tables have migrated to the nova_api db and we need to migrate the data as well. Currently we do: nova-manage db sync nova-manage api_db sync We want to add: nova-manage db online_data_migrations After launching this command the overcloud-pingtest works correctly: tripleo.sh -- Overcloud pingtest SUCCEEDED Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096 Closes-Bug: #1628450
2016-09-28Merge "Deprecate the NeutronL3HA parameter"Jenkins1-7/+23
2016-09-27Fix NTP servers hieradataMarius Cornea1-1/+1
This patch enables correctly setting the NTP server passed via --ntp-server in the overcloud nodes' /etc/ntp.conf. Change-Id: Iff644b9da51fb8cd1946ad9d297ba0e94d3d782b
2016-09-27Merge "Remove deprecated scheduler_driver settings"Jenkins1-0/+2
2016-09-27Merge "Add metricd workers support in gnocchi"Jenkins2-0/+6
2016-09-27Merge "Use parameter name to configure gmcast_listen_addr"Jenkins1-0/+8
2016-09-27Merge "Set manila::keystone::auth::tenant"Jenkins1-0/+1
2016-09-27Merge "Disable openstack-cinder-volume in step1 and reenable it in step2"Jenkins2-0/+5
2016-09-27Merge "Activate StorageMgmtPort on computes in HCI environment"Jenkins1-5/+4
2016-09-26Set manila::keystone::auth::tenantTom Barron1-0/+1
Without setting this parameter, overcloud deploy fails and 'openstack stack failures list overcloud' reveals the following error: Error: Puppet::Type::Keystone_user_role::ProviderOpenstack: Could not find project with name [services] and domain [Default] Error: /Stage[main]/Manila::Keystone::Auth/Keystone::Resource::Service_identity[manilav2]/Keystone_user_role[manilav2@services]: Could not evaluate: undefined method `[]' for nil:NilClass When we set manila::keystone::auth::tenant to 'service', analogous to cinder, nova, etc., the overcloud deploy completes successfully. Change-Id: I996ac2ff602c632a9f9ea9c293472a6f2f92fd72
2016-09-27Merge "Add FixedIPs parameter to from_service.yaml"Jenkins2-0/+12
2016-09-27Merge "Fix ignore warning on ceph major upgrade."Jenkins1-1/+1
2016-09-27Merge "Add integration with Manila CephFS Native driver"Jenkins4-0/+81
2016-09-27Merge "A few major-upgrade issues"Jenkins3-25/+46
2016-09-27Merge "Start mongod before calling ceilometer-dbsync"Jenkins1-0/+7
2016-09-27Merge "Reinstantiate parts of code that were accidentally removed"Jenkins2-0/+9
2016-09-27Merge "Neutron metadata agent worker count fix"Jenkins1-3/+10
2016-09-27Merge "Remove double definition of config_settings key in keystone"Jenkins1-1/+0
2016-09-26Fix predictable placement indexingBen Nemec1-0/+14
As noted in the bug, predictable placement is broken right now because the %index% in the scheduler hint isn't being interpolated. This is because the parameter was moved from overcloud.yaml to the service-specific files, which doesn't provide the index value. Because the Compute role's parameter is named NovaCompute... we also have to include some backwards compatibility logic to handle the mismatch. Change-Id: Ibee2949fe4c6c707203d7250e2ce169c769b1dcd Closes-Bug: 1627858
2016-09-26Fix ignore warning on ceph major upgrade.Sofer Athlan-Guyot1-1/+1
The paramater IgnoreCephUpgradeWarnings is type cast into a boolean which is rendered as 'True' or 'False' as a string not 'true' or 'false'. This fix the check. Change-Id: I8840c384d07f9d185a72bde5f91a3872a321f623 Closes-Bug: 1627736
2016-09-26Merge "Bind MySQL address to hostname appropriate to its network"Jenkins2-1/+14
2016-09-26Use parameter name to configure gmcast_listen_addrJuan Antonio Osorio Robles1-0/+8
This used to used mysql_bind_ip, but this parameter is quite misleading since what it actually configures is not the bind-ip itself, but the gmcast.listen_addr parameter. This fixes that confusion. Depends-On: Iea4bd67074824e5dc6732fd7e408743e693d80b3 Change-Id: I2b114600e622491ccff08a07946926734b50ac70
2016-09-26Remove double definition of config_settings key in keystoneJuan Antonio Osorio Robles1-1/+0
Change-Id: I291bfb1e5736864ea504cd82eea1d4001fcdd931
2016-09-26Bind MySQL address to hostname appropriate to its networkJuan Antonio Osorio Robles2-1/+14
This now takes into use the mysql_bind_host key, to set an appropriate fqdn for mysql to bind to. Closes-Bug: #1627060 Change-Id: I50f4082ea968d93b240b6b5541d84f27afd6e2a3 Depends-On: I316acfd514aac63b84890e20283c4ca611ccde8b
2016-09-26Add metricd workers support in gnocchiCarlos Camacho2-0/+6
Depending on the environment, gnocchi workers uses several controller resources RAM/CPU, this option makes it configurable. Also, configured to 1 in environments/low-memory-usage.yaml which will reduce the service footprint in i.e. CI Change-Id: Ia008b32151f4d8fec586cf89994ac836751b7cce Closes-bug: #1626473
2016-09-25get_param calls with multiple arguments need brackets around themMichele Baldessari8-19/+19
This issue was spotted during major upgrade where we had calls like this: servers: {get_param: servers, Controller} These get_param calls are hanging indefinitely and make the whole upgrade end in a timeout. We need to put brackets around the get_param function when there are multiple arguments: http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#get-param This is already done in most of the tree, and the few places where this was not happening were parts not under CI. After this change the following grep returns only one false positive: grep -ir get_param: |grep -v -- '\[' |grep ',' Change-Id: I65b23bb44f37b93e017dd15a5212939ffac76614 Closes-Bug: #1626628
2016-09-25A few major-upgrade issuesMichele Baldessari3-25/+46
This commit does the following: 1. We now explicitly disable/stop and then remove the resources that are moving to systemd. We do this because we want to make sure they are all stopped before doing a yum upgrade, which otherwise would take ages due to rabbitmq and galera being down. It is best if we do this via pcs while we do the HA Full -> HA NG migration because it is simpler to make sure all the services are stopped at that stage. For extra safety we can still do a check by hand. By doing it via pacemaker we have the guarantee that all the migrated services are down already when we stop the cluster (which happens to be a syncronization point between all controller nodes). That way we can be certain that they are all down on all nodes before starting the yum upgrade process. 2. We actually need to start the systemd services in major_upgrade_controller_pacemaker_2.sh and not stop them. 3. We need to use the proper bash variable name 4. Use is_bootstrap_node everywhere to make the code more consistent Change-Id: Ic565c781b80357bed9483df45a4a94ec0423487c Closes-Bug: #1627490
2016-09-25Disable openstack-cinder-volume in step1 and reenable it in step2Michele Baldessari2-0/+5
Currently we do not disable openstack-cinder-volume during our major-upgrade-pacemaker step. This leads to the following scenario. In major_upgrade_controller_pacemaker_2.sh we do: start_or_enable_service galera check_resource galera started 600 .... if [[ -n $(is_bootstrap_node) ]]; then ... cinder-manage db sync ... What happens here is that since openstack-cinder-volume was never disabled it will already be started by pacemaker before we call cinder-manage and this will give us the following errors during the start: 06:05:21.861 19482 ERROR cinder.cmd.volume DBError: (pymysql.err.InternalError) (1054, u"Unknown column 'services.cluster_name' in 'field list'") Change-Id: I01b2daf956c30b9a4985ea62cbf4c941ec66dcdf Closes-Bug: #1627470
2016-09-25Start mongod before calling ceilometer-dbsyncMichele Baldessari1-0/+7
Currently we in major_upgrade_controller_pacemaker_2.sh we are calling ceilometer-dbsync before mongod is actually started (only galera is started at this point). This will make the dbsync hang indefinitely until the heat stack times out. Now this approach should be okay, but do note that when we start mongod via systemctl we are not guaranteed that it will be up on all nodes before we call ceilometer-dbsync. This *should* be okay because ceilometer-dbsync keeps retrying and eventually one of the nodes will be available. A completely clean fix here would be to add another step in heat to have the guarantee that all mongo servers are up and running before the dbsync call. Change-Id: I10c960b1e0efdeb1e55d77c25aebf1e3e67f17ca Closes-Bug: #1627453
2016-09-25Remove deprecated scheduler_driver settingsMichele Baldessari1-0/+2
In bug https://bugs.launchpad.net/tripleo/+bug/1615035 we fixed the scheduler_host setting which got deprecated in newton. It seems also the scheduler_driver settings needs tweaking: systemctl status openstack-nova-scheduler.service: 2016-09-24 20:24:54.337 15278 WARNING stevedore.named [-] Could not load nova.scheduler.filter_scheduler.FilterScheduler 2016-09-24 20:24:54.338 15278 CRITICAL nova [-] RuntimeError: (u'Cannot load scheduler driver from configuration %(conf)s.', {'conf': 'nova.scheduler.filter_scheduler.FilterScheduler'}) Let's set this to default during the upgrade step. From newton's nova.conf: The class of the driver used by the scheduler. This should be chosen from one of the entrypoints under the namespace 'nova.scheduler.driver' of file 'setup.cfg'. If nothing is specified in this option, the 'filter_scheduler' is used. This option also supports deprecated full Python path to the class to be used. For example, "nova.scheduler.filter_scheduler.FilterScheduler". But note: this support will be dropped in the N Release. Change-Id: Ic384292ad05a57757158995ec4c1a269fe4b00f1 Depends-On: I89124ead8928ff33e6b6907a7c2178169e91f4e6 Closes-Bug: #1627450
2016-09-25Reinstantiate parts of code that were accidentally removedMichele Baldessari2-0/+9
With commit fb25385d34e604d2f670cebe3e03fd57c14fa6be "Rework the pacemaker_common_functions for M..N upgrades" we accidentally removed some lines that fixed M/N upgrade issues. Namely: extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh -# https://bugzilla.redhat.com/show_bug.cgi?id=1284047 -# Change-Id: Ib3f6c12ff5471e1f017f28b16b1e6496a4a4b435 -crudini --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend rabbit -# https://bugzilla.redhat.com/show_bug.cgi?id=1284058 -# Ifd1861e3df46fad0e44ff9b5cbd58711bbc87c97 Swift Ceilometer middleware no longer exists -crudini --set /etc/swift/proxy-server.conf pipeline:main pipeline "catch_errors healthcheck cache ratelimit tempurl formpost authtoken keystone staticweb proxy-logging proxy-server" -# LP: 1615035, required only for M/N upgrade. -crudini --set /etc/nova/nova.conf DEFAULT scheduler_host_manager host_manager extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh nova-manage db sync - nova-manage api_db sync This patch simply puts that code back without reverting the whole commit that broke things, because that is needed. Closes-Bug: #1627448 Change-Id: I89124ead8928ff33e6b6907a7c2178169e91f4e6
2016-09-23Add FixedIPs parameter to from_service.yamlBen Nemec2-0/+12
Without this, deployments using the from_service.yaml port for service VIPs will fail with: "Property error: : resources.RedisVirtualIP.properties: : Unknown Property FixedIPs" Change-Id: Ie0d3b940a87741c56fe022c9e50da0d3ae9b583b Closes-Bug: 1627189
2016-09-23Merge "Remove hard-coded roles in EnabledServices output"Jenkins1-5/+3
2016-09-23Add integration with Manila CephFS Native driverErno Kuvaja4-0/+81
Enables configuring CephFS Native backend for Manila. This change is based on the usage of environments like in review https://review.openstack.org/#/c/354019 for Netapp driver. Co-Authored-By: Marios Andreou <marios@redhat.com> Change-Id: If013d796bcdfe48b2c995bcab462c89c360b7367 Depends-On: I918f6f23ae0bd3542bcfe1bf0c797d4e6aa8f4d9 Depends-On: I2b537f735b8d1be8f39e8c274be3872b193c1014
2016-09-23Move keystone::auth into service_config_settingsDan Prince19-101/+152
This patch moves the keystone::auth settings for all services into the new service_config_settings section. This is important because we execute the keystone commands via puppet only on the role containing the keystone service and without these settings it will fail. Note that yaql merging/filtering is used here to ensure that service_config_settings is optional in service templates, and also that we'll only deploy hieradata for a given service on a node running the service (the key in the service_config_settings map must match the service_name in the service template for this to work). e.g the following will result in only deploying keystone: 123 in hiera on the role running the "keystone" service, regardless of which service template defines it. service_config_settings: keystone: keystone: 123 Co-Authored-By: Steven Hardy <shardy@redhat.com> Change-Id: I0c2fce037a1a38772f998d582a816b4b703f8265 Closes-bug: 1620829
2016-09-23Merge "Tolerate missing keys from role_data in service templates"Jenkins1-6/+10