Age | Commit message (Collapse) | Author | Files | Lines |
|
We currently create remote resources without waiting for their creation.
This leads to the following potential race (spotted by Marian Mkrcmari):
- On Step1 pacemaker bootstrap node creates the resource but the remote
resource is not yet created
- Step1 completes and Step2 starts
- On Step2 the remote node sets a property (or calls pcs cib) but the
remote is not yet set up so 'pcs cluster cib' will fail there with:
(err): Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster
cib /var/lib/pacemaker/cib/puppet-cib-backup20170506-15994-1swnk1i failed
with code: 1 ->
Note that when verify_on_create is set to true we are not using the cib
dump/push mechanism. That is fine because we create the remotes on
step1 and the dump/push mechanism is only needed starting from step2
when multiple nodes set cluster properties at the same time.
Tested by Marian Mkrcmari successfully as well.
Closes-Bug: #1689028
Change-Id: I764526b3f3c06591d477cc92779d83a19802368e
Depends-On: I1db31dcc92b8695ab0522bba91df729b37f34e0f
(cherry picked from commit b6d02fd5001153b53b3061d63d2cb686b0646f18)
|
|
This change will make the global cluster-check property configurable
and will pick a lower default (60s) in case a pacemaker remote node
is deployed.
The cluster-recheck-interval is set to default to 15minutes by
pacemaker. This value is too high when a pacemaker remote service
is deployed. With this default value a reboot of a pacemaker remote
node will be reported as offline by pacemaker for up to 15minutes.
With this change we do the following:
1) Do nothing in case pacemaker remote is not deployed
2) When pacemaker remote is deployed and the operator has not
specified otherwise, we set the recheck interval to 60s.
3) When the operator specifies the recheck interval we set that.
Change-Id: I900952b33317b7998a1f26a65f4d70c1726df19c
Closes-Bug: #1679753
(cherry picked from commit f464e9f703b824f8971ade50c32884748caffefc)
|
|
This support enables a base profile called pacemaker_remote which will
allow the operator to automatically configure the pacemaker_remote
service on such nodes. This manifest also automatically adds any
pacemaker_remote nodes to the pacemaker cluster.
Depends-On: I0c01ecb7df1a0f9856fdc866b9d06acf0283fa4f
Depends-On: Ic0488f4fc63e35b9aede60fae1e2cab34b1fbdd5
Change-Id: I92953afcc7d536d387381f08164cae8b52f41605
|
|
Let's set a default number of retries also for the stonith
property creation. Just like we do for most of the composable
HA resource creation.
Change-Id: Ie6e19cc838a3f45100f6c98a350bdf6a37d40590
Depends-On: I20098c5d69cde356fe79f6d8dbdc03ae42ecb3ef
|
|
When we create a pacemaker resource it must happen from a single node.
If it happens from multiple nodes an immediate error will be returned by
pcs.
For the pacemaker roles we enforce this by leveraging the recently
introduced <SERVICE_NAME_bootstrap_short_node_name> which gives us
the first hostname per-service, regardless of the role.
(introduced via I03e8685f939e8ae1fcd8b16883b559615042505d)
With this approach if a pacemaker service belongs to two different
roles (say role Controller on node A and role galera on node B), it
will only create the resource from one of the two and not both (which
would return an error).
Only setting Partial-Bug for this one, because it addresses the issue
from the pacemaker resource creation POV (which is always affected). But
the issue itself is a race that we're theoretically affected by since
the composable roles work landed. While I have tried to fix the more
general case in previous attempts, I think it is best if we start a
discussion on how to fix it, because each approach has a bunch of
potential drawbacks and is quite invasive on how we do things. A
discussion slot for this has been proposed for the Atlanta PTG.
Change-Id: I662398cab60d523d204b57a5674ca8f5c0f2e68a
Partial-Bug: #1615983
|
|
Currently we chose the pacemaker cluster node by simply taking
hiera('controller_node_names'). We should instead use the
pacemaker_short_node_names array which is built dynamically
from all the nodes that include the pacemaker service.
Change-Id: I0a3e4acaab11e078da5eeb2ef2adde5387785927
|
|
Currently the /var/lib/tripleo/pacemaker-restarts directory is created
only when base/pacemaker.pp file is included in the manifest. There is a
notification that ensures precedence order and trigger the touch.
The trigger and the dependency on the base/pacemaker.pp should not be
required as someone using the tripleo::pacemaker::resource_restart_flag
would expect the file to be created no matter what.
For instance in the Cinder upgrade in the convergence step has this
defined:
Cinder_config<||> ~> Tripleo::Pacemaker::Resource_restart_flag["${::cinder::params::volume_service}"]
but in the convergence step, the base/pacemaker.pp is not included and
the above trigger fails as the directory is not created.
It looks the same for manilla.pp.
This patch removes the trigger and ensures the directory is created when
needed.
Change-Id: Ic3aa82c818662e9e88e21c8381d657adef5b43ac
Closes-Bug: #1632232
|
|
Back in the Mitaka cycle via the change If6b43982c958f63bc78ad997400bf1279c23df7e
we made sure that the default start and stop timeouts for pacemaker
systemd resources is 200s (>= twice the default 90s DefaultTimeoutStopSec
in systemd). We did this change by setting puppet resource defaults for
the Pacemaker::Resource::Service class:
Pacemaker::Resource::Service {
op_params => 'start timeout=200s stop timeout=200s',
}
The problem is that after the composable services rework, this does not
work anymore and the pacemaker systemd resources that still exist do not
have these timeouts set.
We want to move away from resource defaults for this because its results
are dependent on the inclusion order which in tripleo is not guaranteed
any longer (https://docs.puppet.com/puppet/latest/reference/lang_scope.html#scope-lookup-rules)
The only services affected in Newton are: cinder-volume,
cinder-backup, manila-share, haproxy. I preferred fixing all the
pacemaker resources because it seems the cleanest and most logical
commit.
Change-Id: If89a95706514e536a7a2949871a0002c79b6046e
Closes-Bug: #1629366
|
|
Write restart flag file for services managed by Pacemaker into
/var/lib/tripleo/pacemaker-restarts directory. The name of the file must
match the name of the clone resource defined in pacemaker. The
post-puppet restart script will restart each service having a restart
flag file and remove those files.
This approach focuses on $pacemaker_master only (we don't want to
restart the pacemaker services 3 times when we have 3 controllers), so
it relies on the assumption that we're making the matching config
changes across the pacemaker nodes.
Change-Id: I6369ab0c82dbf3c8f21043f8aa9ab810744ddc12
|
|
As we are staring to manually check overcloud services
the first step is to check that the puppet profiles
are all aligned.
Changes applied:
No logic added or removed in this submission.
Removed unused parameters.
Align header comments structure.
All profiles parameters sorted following:
"Mandatory params first sorted alphabetically
then optional params sorted alphabetically."
Note: Following submissions will check pacemaker,
cinder, mistral and redis services in the base profiles
as some of them has the $pacemaker_master parameter
defaulted to true.
Change-Id: I2f91c3f6baa33f74b5625789eec83233179a9655
|
|
The openstack-core resource is not needed by the NG Pacemaker
architecture. It was moved into an isolated role by [1] so that
it could optionally be enabled when wanting the older architecture.
This submission removes the old openstack-core global resource.
1. I74a62973146c0261385ecf5fd3d06db51e079caa
Change-Id: I16a786ce167c57848551c7245f4344c382c55b3d
|
|
In some profiles, we were looking up the $step by using Hiera
again, while we already do it in the parameter definition.
When using this class outside THT, it will fail but with this patch, we
could use just set the $step parameter and the rest of the manifest will
work.
Change-Id: I7082f47204fb4e529b164e4c4f1032e7bdd88f02
|
|
The dummy openstack-core resource was meant to replace keystone so that
restarting keystone would not restart the whole cloud. When this
resource was introduced the paramter interleave=true was mistakenly left
out.
This causes a simple promote operation on the galera resource to restart
openstack-core and its children.
Change-Id: Ic590005a9419be87e6e6ea131b0ac0630c5afc19
Closes-Bug: 1603381
|
|
Change-Id: I46215f82480854b5e04aef1ac1609dd99455181b
Closes-Bug: #1601970
|