aboutsummaryrefslogtreecommitdiffstats
path: root/manifests/profile/base/pacemaker.pp
AgeCommit message (Collapse)AuthorFilesLines
2017-05-19Use verify_on_create when creating pacemaker remote resourcesMichele Baldessari1-0/+1
We currently create remote resources without waiting for their creation. This leads to the following potential race (spotted by Marian Mkrcmari): - On Step1 pacemaker bootstrap node creates the resource but the remote resource is not yet created - Step1 completes and Step2 starts - On Step2 the remote node sets a property (or calls pcs cib) but the remote is not yet set up so 'pcs cluster cib' will fail there with: (err): Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20170506-15994-1swnk1i failed with code: 1 -> Note that when verify_on_create is set to true we are not using the cib dump/push mechanism. That is fine because we create the remotes on step1 and the dump/push mechanism is only needed starting from step2 when multiple nodes set cluster properties at the same time. Tested by Marian Mkrcmari successfully as well. Closes-Bug: #1689028 Change-Id: I764526b3f3c06591d477cc92779d83a19802368e Depends-On: I1db31dcc92b8695ab0522bba91df729b37f34e0f (cherry picked from commit b6d02fd5001153b53b3061d63d2cb686b0646f18)
2017-04-07Make the cluster-check property configurableMichele Baldessari1-0/+25
This change will make the global cluster-check property configurable and will pick a lower default (60s) in case a pacemaker remote node is deployed. The cluster-recheck-interval is set to default to 15minutes by pacemaker. This value is too high when a pacemaker remote service is deployed. With this default value a reboot of a pacemaker remote node will be reported as offline by pacemaker for up to 15minutes. With this change we do the following: 1) Do nothing in case pacemaker remote is not deployed 2) When pacemaker remote is deployed and the operator has not specified otherwise, we set the recheck interval to 60s. 3) When the operator specifies the recheck interval we set that. Change-Id: I900952b33317b7998a1f26a65f4d70c1726df19c Closes-Bug: #1679753 (cherry picked from commit f464e9f703b824f8971ade50c32884748caffefc)
2017-01-24pacemaker remote profile supportMichele Baldessari1-2/+58
This support enables a base profile called pacemaker_remote which will allow the operator to automatically configure the pacemaker_remote service on such nodes. This manifest also automatically adds any pacemaker_remote nodes to the pacemaker cluster. Depends-On: I0c01ecb7df1a0f9856fdc866b9d06acf0283fa4f Depends-On: Ic0488f4fc63e35b9aede60fae1e2cab34b1fbdd5 Change-Id: I92953afcc7d536d387381f08164cae8b52f41605
2017-01-19Add retries to the ::pacemaker::stonith propertyMichele Baldessari1-1/+7
Let's set a default number of retries also for the stonith property creation. Just like we do for most of the composable HA resource creation. Change-Id: Ie6e19cc838a3f45100f6c98a350bdf6a37d40590 Depends-On: I20098c5d69cde356fe79f6d8dbdc03ae42ecb3ef
2017-01-18Do not depend on bootstrap_nodeid for any pacemaker profileMichele Baldessari1-1/+1
When we create a pacemaker resource it must happen from a single node. If it happens from multiple nodes an immediate error will be returned by pcs. For the pacemaker roles we enforce this by leveraging the recently introduced <SERVICE_NAME_bootstrap_short_node_name> which gives us the first hostname per-service, regardless of the role. (introduced via I03e8685f939e8ae1fcd8b16883b559615042505d) With this approach if a pacemaker service belongs to two different roles (say role Controller on node A and role galera on node B), it will only create the resource from one of the two and not both (which would return an error). Only setting Partial-Bug for this one, because it addresses the issue from the pacemaker resource creation POV (which is always affected). But the issue itself is a race that we're theoretically affected by since the composable roles work landed. While I have tried to fix the more general case in previous attempts, I think it is best if we start a discussion on how to fix it, because each approach has a bunch of potential drawbacks and is quite invasive on how we do things. A discussion slot for this has been proposed for the Atlanta PTG. Change-Id: I662398cab60d523d204b57a5674ca8f5c0f2e68a Partial-Bug: #1615983
2016-12-08Do not use hardcoded controller_node_names when setting up the clusterMichele Baldessari1-1/+2
Currently we chose the pacemaker cluster node by simply taking hiera('controller_node_names'). We should instead use the pacemaker_short_node_names array which is built dynamically from all the nodes that include the pacemaker service. Change-Id: I0a3e4acaab11e078da5eeb2ef2adde5387785927
2016-10-13Ensure presence of pacemaker restart directory.Sofer Athlan-Guyot1-4/+0
Currently the /var/lib/tripleo/pacemaker-restarts directory is created only when base/pacemaker.pp file is included in the manifest. There is a notification that ensures precedence order and trigger the touch. The trigger and the dependency on the base/pacemaker.pp should not be required as someone using the tripleo::pacemaker::resource_restart_flag would expect the file to be created no matter what. For instance in the Cinder upgrade in the convergence step has this defined: Cinder_config<||> ~> Tripleo::Pacemaker::Resource_restart_flag["${::cinder::params::volume_service}"] but in the convergence step, the base/pacemaker.pp is not included and the above trigger fails as the directory is not created. It looks the same for manilla.pp. This patch removes the trigger and ensures the directory is created when needed. Change-Id: Ic3aa82c818662e9e88e21c8381d657adef5b43ac Closes-Bug: #1632232
2016-10-03Fix the timeout for pacemaker systemd resourcesMichele Baldessari1-7/+0
Back in the Mitaka cycle via the change If6b43982c958f63bc78ad997400bf1279c23df7e we made sure that the default start and stop timeouts for pacemaker systemd resources is 200s (>= twice the default 90s DefaultTimeoutStopSec in systemd). We did this change by setting puppet resource defaults for the Pacemaker::Resource::Service class: Pacemaker::Resource::Service { op_params => 'start timeout=200s stop timeout=200s', } The problem is that after the composable services rework, this does not work anymore and the pacemaker systemd resources that still exist do not have these timeouts set. We want to move away from resource defaults for this because its results are dependent on the inclusion order which in tripleo is not guaranteed any longer (https://docs.puppet.com/puppet/latest/reference/lang_scope.html#scope-lookup-rules) The only services affected in Newton are: cinder-volume, cinder-backup, manila-share, haproxy. I preferred fixing all the pacemaker resources because it seems the cleanest and most logical commit. Change-Id: If89a95706514e536a7a2949871a0002c79b6046e Closes-Bug: #1629366
2016-08-30Write restart flags to restart services only when necessaryJiri Stransky1-0/+4
Write restart flag file for services managed by Pacemaker into /var/lib/tripleo/pacemaker-restarts directory. The name of the file must match the name of the clone resource defined in pacemaker. The post-puppet restart script will restart each service having a restart flag file and remove those files. This approach focuses on $pacemaker_master only (we don't want to restart the pacemaker services 3 times when we have 3 controllers), so it relies on the assumption that we're making the matching config changes across the pacemaker nodes. Change-Id: I6369ab0c82dbf3c8f21043f8aa9ab810744ddc12
2016-08-08Fix parameters and headers inconsistency in the puppet manifests.Carlos Camacho1-1/+0
As we are staring to manually check overcloud services the first step is to check that the puppet profiles are all aligned. Changes applied: No logic added or removed in this submission. Removed unused parameters. Align header comments structure. All profiles parameters sorted following: "Mandatory params first sorted alphabetically then optional params sorted alphabetically." Note: Following submissions will check pacemaker, cinder, mistral and redis services in the base profiles as some of them has the $pacemaker_master parameter defaulted to true. Change-Id: I2f91c3f6baa33f74b5625789eec83233179a9655
2016-07-27Remove global openstack-core resourceGiulio Fidente1-6/+0
The openstack-core resource is not needed by the NG Pacemaker architecture. It was moved into an isolated role by [1] so that it could optionally be enabled when wanting the older architecture. This submission removes the old openstack-core global resource. 1. I74a62973146c0261385ecf5fd3d06db51e079caa Change-Id: I16a786ce167c57848551c7245f4344c382c55b3d
2016-07-22use parameter to lookup the step instead of hiera againEmilien Macchi1-3/+3
In some profiles, we were looking up the $step by using Hiera again, while we already do it in the parameter definition. When using this class outside THT, it will fail but with this patch, we could use just set the $step parameter and the rest of the manifest will work. Change-Id: I7082f47204fb4e529b164e4c4f1032e7bdd88f02
2016-07-15openstack-core resource does not have interleave=trueMichele Baldessari1-1/+1
The dummy openstack-core resource was meant to replace keystone so that restarting keystone would not restart the whole cloud. When this resource was introduced the paramter interleave=true was mistakenly left out. This causes a simple promote operation on the galera resource to restart openstack-core and its children. Change-Id: Ic590005a9419be87e6e6ea131b0ac0630c5afc19 Closes-Bug: 1603381
2016-07-12Implement Pacemaker service profileEmilien Macchi1-0/+93
Change-Id: I46215f82480854b5e04aef1ac1609dd99455181b Closes-Bug: #1601970