summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2016-10-04Merge "Make keystone api network hiera composable"Jenkins3-24/+25
2016-10-04Merge "Set ceph osd max object name and namespace len on upgrade when on ext4"Jenkins1-0/+10
2016-10-03Merge "reload HAProxy config in HA setups when certificate is updated"Jenkins1-4/+2
2016-10-03Merge "Update $service to $resource this variable does not exist in the context"Jenkins1-1/+1
2016-10-03Merge "Cinder volume service is not managed by Pacemaker on BlockStorage"Jenkins4-2/+3
2016-10-03Merge "Change the rabbitmq ha policies during an M/N Upgrade"Jenkins2-1/+24
2016-10-03Update $service to $resource this variable does not exist in the contextMathieu Bultel1-1/+1
heat failed due to a: service: unbound variable In the context $service is never set. Change-Id: If82ee4562612f2617b676732956396278ee40a88 Closes-Bug: #1629903
2016-10-03reload HAProxy config in HA setups when certificate is updatedJuan Antonio Osorio Robles1-4/+2
When updating a certificate for HAProxy, we only do a reload of the configuration on non-HA setups. This means that if we try the same in an HA setup, the cloud will still serve the old certificate and that leads to several issues, such as serving a revoked or even a compromised certificate for some time, or just SSL issues that the certificate doesn't match. This enables a reload for HA cases too. Change-Id: Ib8ca2fe91be345ef4324fc8265c45df8108add7a Closes-Bug: #1629886
2016-10-03Merge "Fixed NoneType issue when monitoring-environment.yaml"Jenkins1-1/+1
2016-10-03Merge "Balance Rabbitmq Queue Master Location on queue declaration with ↵Jenkins1-0/+1
min-masters strategy"
2016-10-03Merge "Change rabbitmq queues HA mode from ha-all to ha-exactly"Jenkins1-0/+9
2016-10-03Change the rabbitmq ha policies during an M/N UpgradeMichele Baldessari2-1/+24
This takes care of the M->N upgrade path when changing the ha rabbitmq policy. Partial-Bug: #1628998 Change-Id: I2468a096b5d7042bc801a742a7a85fb1521c1c02
2016-10-03Merge "Fixed NoneType issue when logging-environment.yaml is used"Jenkins1-1/+1
2016-10-01Change rabbitmq queues HA mode from ha-all to ha-exactlyMichele Baldessari1-0/+9
It turns out that reducing number of rabbitmq queues in cluster significantly improves performance of cluster especially in the case of failover recovery time. Right now the cluster uses ha-all mode for rabbitmq queues. It is best to change this to "ha-exactly" mode and reduce the number of queue copies to ceil(N/2) where N is number of controllers in the cluster - so in typical scenario of 3 controller It would be 2 by default. It does not make much sense to keep the copies of queues over whole cluster since if the quorum of nodes is lost then the rest of cluster nodes will be stopped anyway. We let the user override this with a parameter. I.e. for a 3 node controlplane cluster we will go from this: pcs resource show rabbitmq Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster) Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}" To this: pcs resource show rabbitmq Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster) Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"exactly","ha-params":2}" According to Marin Krcmarik's testing recovery time from failure was reduced significantly. Partial-Bug: #1628998 Change-Id: Iace6daf27a76cb8ef1050ada0de7ff1f530916c6
2016-09-30Merge "telemetry: remove coordination_url hiera settings"Jenkins3-25/+1
2016-09-30Merge "Telemetry: add redis_password hiera parameter"Jenkins3-0/+3
2016-09-30Merge "Replace per role manifests with a common role manifest"Jenkins10-101/+37
2016-09-30Make keystone api network hiera composableSteven Hardy3-24/+25
These hard-coded references to the Controller role mean that things won't work if the keystone service is moved to any other role, so we need to generate the lists dynamically based on the enabled services for each role. Change-Id: I5f1250a8a1a38cb3909feeb7d4c1000fd0fabd14 Closes-Bug: #1629096
2016-09-30Replace per role manifests with a common role manifestSteven Hardy10-101/+37
This removes the (nearly empty) per role manifests, and replaces them with a generic manifest, where we use str_replace to substitute the role name at runtime (or in some cases a subset of the name for backwards compatibility) Change-Id: I79da0f523189959b783bbcbb3b0f37be778e02fe Partial-Bug: #1626976
2016-09-30telemetry: remove coordination_url hiera settingsEmilien Macchi3-25/+1
They are now normalized and set in puppet-tripleo. Change-Id: I197481c577b85894178e7899a55869da47847755 Closes-Bug: #1629279 Depends-On: Ic6de09acf0d36ca90cc2041c0add1bc2b4a369a5
2016-09-30Telemetry: add redis_password hiera parameterEmilien Macchi3-0/+3
Add redis_password parameter in Hiera so we can re-use it from puppet-tripleo later for Aodh, Ceilometer and Gnocchi. Change-Id: I038e2bac22e3bfa5047d2e76e23cff664546464d Partial-Bug: #1629279
2016-09-30Fixed NoneType issue when monitoring-environment.yamlJuan Badia Payno1-1/+1
When you tried to use the environemnt/monitoring-environment.yaml as a part of the deployment on the overcloud you hit the following error and it stops the deploy of the overcloud. *** Deploying templates in the directory /home/stack/tripleo-heat-templates 'NoneType' object does not support item assignment *** Closes-Bug: #1629323 Change-Id: I8cf2e7d8f3a4e79cc71a1566ec17d0a977c38d60 Signed-off-by: Juan Badia Payno <jbadiapa@redhat.com>
2016-09-30Fixed NoneType issue when logging-environment.yaml is usedJuan Badia Payno1-1/+1
When you tried to use the environemnt/logging-environemnt.yaml as a part of the deployment on the overcloud you hit the following error and it stops the deploy of the overcloud. *** Deploying templates in the directory /home/stack/tripleo-heat-templates 'NoneType' object does not support item assignment *** Closes-Bug: #1629315 Change-Id: I55e5c7f20ddf30f3e48247b734f6fa47f5de3750 Signed-off-by: Juan Badia Payno <jbadiapa@redhat.com>
2016-09-30Merge "Add option to specify Certmonger CA"Jenkins1-0/+8
2016-09-30Merge "Move the rest of static roles resource registry entries to j2"Jenkins4-14/+4
2016-09-29Merge "Use -L with chown and set crush map tunables when upgrading Ceph"Jenkins2-4/+8
2016-09-29Merge "Fix typo in fixing gnocchi upgrade."Jenkins1-1/+1
2016-09-29Merge "Add gateway_ip in OS::Neutron::Subnet"Jenkins11-1/+24
2016-09-29Add option to specify Certmonger CAJuan Antonio Osorio Robles1-0/+8
This will be used for internal (or even public) TLS, for when certmonger is generating the certificates. This same setting is used for the undercloud with the generate_service_certificate option. Change-Id: Ic54fe512b9ed5c71417a66491b7954e653f660b6
2016-09-29Balance Rabbitmq Queue Master Location on queue declaration with min-masters ↵Michele Baldessari1-0/+1
strategy It may happen that one of the controllers may become unavailable and Queue Masters will be located on available controllers during queue declarations. Once a lost controller will be become available masters of newly declared queues are not placed with priority to such controller with obviously lower number of queue masters and thus the distribution may be unbalanced and one of the controllers may become under significantly higher load in some circumstances of multiple fail-overs. With rabbit 3.6.0 rabbitmq introduced a new HA feature of Queue masters distribution - one of the strategies is min-masters, which picks the node hosting the minimum number of masters. One of the ways how to turn such min-masters strategy on is by adding following into configuration file - rabbitmq.config {rabbit,[ .. {queue_master_locator, <<"min-masters">>}, .. ]}, Change-Id: I61bcab0e93027282b62f2a97bec87cbb0a6e6551 Closes-Bug: #1629010
2016-09-29Set ceph osd max object name and namespace len on upgrade when on ext4Giulio Fidente1-0/+10
As per [1] we need to lower osd max object name and namespace len when upgrading from Hammer and the OSD is backed by ext4. These could also be given via ExtraConfig but on upgrade we only run puppet apply after this script is executed, so the values won't be effective unless the daemon is restarted. Yet we do not want puppet to restart the daemon because we can't bring all OSDs down unconditionally or guests will die. 1. http://tracker.ceph.com/issues/16187 Co-Authored-By: Michele Baldessari <michele@acksyn.org> Co-Authored-By: Dimitri Savineau <dsavinea@redhat.com> Change-Id: I7fec4e2426bdacd5f364adbebd42ab23dcfa523a Closes-Bug: 1628874
2016-09-29Cinder volume service is not managed by Pacemaker on BlockStorageGiulio Fidente4-2/+3
We do not want cinder-volume to be managed by Pacemaker on BlockStorage nodes, where Pacemaker is not running at all. This change adds a new BlockStorageCinderVolume service name which can (and is, by default) mapped to the non Pacemaker implementation of the service. The error was: Could not find dependency Exec[wait-for-settle] for Pacemaker::Resource::Systemd[openstack-cinder-volume] Also moves cinder::host setting into the Pacemaker specific service definition because we only want to set a shared host= string when the service is managed by Pacemaker. Closes-Bug: #1628912 Change-Id: I2f7e82db4fdfd5f161e44d65d17893c3e19a89c9
2016-09-29Move the rest of static roles resource registry entries to j2Carlos Camacho4-14/+4
Moving the rest of the static based resource registry entries to j2, this allows to extend the content of the template to the roles_list. Also moved the templates to correspond with the role name. Partial-Bug: #1626976 Change-Id: I1cbe101eb4ce5a89cba5f2cc45cace43d3380f22
2016-09-29Merge "j2 template per-role things in default registry"Jenkins1-58/+20
2016-09-29Merge "Relax pre-upgrade check for failed actions"Jenkins2-3/+5
2016-09-29Merge "Fix races in major-upgrade-pacemaker Step2"Jenkins3-17/+41
2016-09-29Fix typo in fixing gnocchi upgrade.Sofer Athlan-Guyot1-1/+1
Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482 Related-Bug: #1626592
2016-09-29Merge "Full HA->HA NG migration might fail setting maintenance-mode"Jenkins1-8/+4
2016-09-29Merge "Update gnocchi database during M/N upgrade."Jenkins1-2/+3
2016-09-29Use -L with chown and set crush map tunables when upgrading CephGiulio Fidente2-4/+8
Previously the chown command wasn't traversing symlinks, causing the new ownership to not be set for some needed files. This change also ensures the crush map tunables are set to the 'default' profile after the upgrade. Finally redirects the output of a pidof to /dev/null to avoid spurious logging. Change-Id: Id4865ffff207edfc727d729f9cc04e6e81ad19d8
2016-09-29Merge "Move db::mysql into service_config_settings"Jenkins23-105/+111
2016-09-29j2 template per-role things in default registrySteven Hardy1-58/+20
The default resource-registry file contains a bunch of per-role things which mean you need to cut/paste into a custom environment file for custom roles, even if you only want the defaults like the built-in roles. Using j2 we can template these just like in the overcloud.j2.yaml and other files. Change-Id: I52a9bffd043ca8fb0f05077c8a401a68def82926 Partial-Bug: #1626976
2016-09-29Relax pre-upgrade check for failed actionsMichele Baldessari2-3/+5
Before this change we checked the cluster for any failed actions and we stopped the upgrade process if there were any. This is likely eccessive as a failed action could have happened in the past and the cluster is now fully functional. Better to check if any of the resources are in Stopped state and break the upgrade process if any of them are. We also need to restrict this check to the bootstrap node because otherwise the following might happen: 1) Bootstrap node does the check, it is successful and it starts the full HA -> HA NG migration which *will* create failed actions and will start stopping resources 2) If the check now starts on a non-bootstrap node while 1) is ongoing, it will find either failed actions or stopped resources so it will fail. Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f Closes-Bug: #1628653
2016-09-29Fix races in major-upgrade-pacemaker Step2Michele Baldessari3-17/+41
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status |grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965
2016-09-28Update gnocchi database during M/N upgrade.Sofer Athlan-Guyot1-2/+3
We call gnocchi-upgrade to make sure we update all the needed schemas during the major-upgrade-pacemaker step. We also make sure that redis is started before we call gnocchi-upgrade otherwise the command will be stuck in a loop trying to contact redis. Closes-Bug: #1626592 Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
2016-09-28Merge "Fix predictable placement indexing"Jenkins1-0/+14
2016-09-28Move db::mysql into service_config_settingsDan Prince23-105/+111
This patch movs the various db::mysql hiera settings into a 'mysql' specific service_config_settings section for each service so that these will only get applied on the MySQL service node. This follows a similar puppet-tripleo change where we create the actual databases for all services locally on the MySQL service node to avoid permission issues. Change-Id: Ic0692b1f7aa8409699630ef3924c4be98ca6ffb2 Closes-bug: #1620595 Depends-On: I05cc0afa9373429a3197c194c3e8f784ae96de5f Depends-On: I5e1ef2dc6de6f67d7c509e299855baec371f614d
2016-09-28Full HA->HA NG migration might fail setting maintenance-modeMichele Baldessari1-8/+4
Currently we do the following in the migration path: pcs property set maintenance-mode=true if ! timeout -k 10 300 crm_resource --wait; then echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting." exit 1 fi crm_resource --wait can actually take forever under certain conditions. The property will be set atomically across the cluster nodes so we should be good without this. Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04 Closes-Bug: #1628393
2016-09-28Fix "Not all flavors have been migrated to the API database"Michele Baldessari1-0/+1
After a successful upgrade to Newton, I ran the tripleo.sh --overcloud-pingtest and it failed with the following: resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409) The issue is the fact that some tables have migrated to the nova_api db and we need to migrate the data as well. Currently we do: nova-manage db sync nova-manage api_db sync We want to add: nova-manage db online_data_migrations After launching this command the overcloud-pingtest works correctly: tripleo.sh -- Overcloud pingtest SUCCEEDED Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096 Closes-Bug: #1628450
2016-09-28Merge "Deprecate the NeutronL3HA parameter"Jenkins1-7/+23