Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
|
|
Based on observed timeouts during updates bump the stop and start
timeouts for pacemaker service resources (via op_params) to 200.
This is based on the reasoning that the full timeout may be as
long as two elapsed timeout intervals. After an initial timeout,
the sigterm that follows is then allowed another
DefaultTimeoutStopSec seconds. The 200s is produced by allowing
this 2xDefaultTimeoutStopSec (@90s for systemd) and some
scheduling delta. Many thanks to Michele Baldessari.
Closes-Bug: 1531204
Change-Id: If6b43982c958f63bc78ad997400bf1279c23df7e
|
|
Using crm_resource --wait we wait for the cluster to get into
a stable state before moving into the next step of the piloted
restart procedure.
Change-Id: I80199653024383fd07900dad0b8d23fb8afade26
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
|
|
|
|
Occasionally we hit "Error: unable to push cib" during update. This is
probably due to the fact that when we try to replace cib in
yum_update.sh, services on the previous updated controller are still
coming up and changing cib, and racing/conflicting with the cib push
from yum_update.sh.
This commit adds waiting for the cluster to settle before exiting from
yum_update.sh, to avoid this kind of conflict.
Also a check for cib-push success is added, to make the update fail
properly instead of hanging indefinitely as we've observed with this
issue.
Change-Id: I953087e0e565474ac553fd57bea2459d2e3a6081
Closes-Bug: #1527644
|
|
In https://review.openstack.org/#/c/248572/ yum_update.sh
sets the pcs constraints before restarting the cluster. However
after post-update pacemaker run, the previous constraint of
neutron-server...neutron-ovs-cleanup is re-added. Explicitly
remove this before the post-update restart of certain services
Change-Id: I84dd650dcc66ce3f48926cf369b7d691014c2254
|
|
This enables pacemaker maintenantce mode when running Puppet on stack
update. Puppet can try to restart some overcloud services, which
pacemaker tries to prevent, and this can result in a failed Puppet run.
At the end of the puppet run, certain pacemaker resources are restarted
in an additional SoftwareDeployment to make sure that any config changes
have been fully applied. This is only done on stack updates (when
UpdateIdentifier is set to something), because the assumption is that on
stack create services already come up with the correct config.
(Change I9556085424fa3008d7f596578b58e7c33a336f75 has been squashed into
this one.)
Change-Id: I4d40358c511fc1f95b78a859e943082aaea17899
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
Co-Authored-By: James Slagle <jslagle@redhat.com>
|
|
There are two reasons the name property should always be set for deployment
resources:
- The name often shows up in logs, files and API calls, the default
derived name is long and unhelpful
- Sorting by name determines the merge order of os-apply-config, and the
execution order of puppet/shell scripts (note this is different to
resource dependency order) so leaving the default name results in an
undetermined order which could lead to unpredictable deployment of
configs
This change simply sets the name to the resource name, but a future change
should prepend each name with a run-parts style 2 digit prefix so that the
order is explicitly stated. Documentation for extraconfig needs to clearly
state what prefix is needed to override which merge/execution order.
For existing overcloud stacks, heat currently replaces deployment resources
when the name changes, so this change
Depends-On: I95037191915ccd32b2efb72203b146897a4edbc9
Change-Id: Ic4bcd56aa65b981275c3d4214588bfc4de63b3b0
|
|
|
|
|
|
When the Overcloud does not host an instance of haproxy, pcmk will
not have any resource named haproxy-clone so we should not add
any constraint relying on it.
Change-Id: I801f07b7570f3805aa71c22998fec6b6f192b350
|
|
|
|
We forgot to apply the mongod timeout in the cib dump first, to
apply it later in a single cib-push step.
Change-Id: Ib104e51782c6d3f646907cdb06c74fd4cbf9028c
|
|
Older neutron versions have a bug which makes them leave keepalived and
radvd running even after all neutron services are stopped, preventing
neutron router failover from happening. Router can then get stuck on the
inactive node, like this:
[stack@instack ~]$ neutron l3-agent-list-hosting-router default_router
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 48ca9477-b93b-4305-9e6d-9f1c5d3388f0 | overcloud-controller-1.localdomain | True | :-) | standby |
| eba0575c-654f-4da6-b1cd-f7fdf1cd3726 | overcloud-controller-2.localdomain | True | :-) | standby |
| 68815390-251f-4425-a5f8-38bdbf3bdb90 | overcloud-controller-0.localdomain | True | xxx | active |
+--------------------------------------+------------------------------------+----------------+-------+----------+
We need to kill the leftover processes manually to prevent the state
described above from happening.
See https://review.gerrithub.io/#/c/248931
Change-Id: I2deaa176222983daa0c33ab52a6aa5dbe7365302
|
|
The neutron pcs constraints were reworked in
https://review.openstack.org/#/c/229466/
For overclouds deployed with older tripleo-heat-templates the
current pcs ordering constraints will not have those changes,
meaning that the behaviour discussed at
https://bugs.launchpad.net/tripleo/+bug/1501378 is likely
given we will stop and restart all services. This review
applies those, in short, remove the ovs-cleanup after
neutron-server and add openvswitch-agent instead. Detail in
the bug report and linked BZ.
Change-Id: I45822c5fe9029f11635400b7fbd386880ac80a4e
Related-Bug: 1501378
|
|
To avoid pcmk reconfiguring the resources on each config change,
we want to apply the constraints and timeouts from file.
We also *do not* want to alter the timeouts for a few ocf resources
which are rabbitmq, neutron-netns-cleanup and neutron-ovs-cleanup
Change-Id: I6875f19e1f34f0fdcf0928421f49b61d857ca7c8
Co-Authored-By: Andrew Beekhof <abeekhof@redhat.com>
|
|
When the cluster is brought back online after a yum update in
yum_update.sh, we should verify that galera is fully sync'd before
moving on. This ensures the sync is complete before moving on to update
any other nodes in the cluster.
Change-Id: Ie8fc2c5d5214deacea94ca658ac75359b318ced1
|
|
This matches change I6fc18f1ad876c5a25723710a3b20d8ec9519dcba, but we
need it to set it before attempting the cluster stop - yum update -
cluster start cycle, to make sure this cycle doesn't hit the low timeout
limits.
This can be removed once updates from deployments made prior to
I6fc18f1ad876c5a25723710a3b20d8ec9519dcba are no longer supported.
Change-Id: I587136d8d045d213875c657ea5a405074f80c8ad
|
|
Some missing pacemaker constraints were added in the following commits:
https://review.openstack.org/#/c/219770/
https://review.openstack.org/#/c/219665/
https://review.openstack.org/#/c/218931/
https://review.openstack.org/#/c/218930/
Overclouds that were deployed prior to these constraints being added to
tripleo-heat-templates still have the constraints missing. During an
update, stopping and starting the cluster can fail without these
constraints in place. As a workaround, conditionally add these
contraints in yum_update.sh so that we're sure they're always present
before updating.
Change-Id: Id46c85dbbe5e85d362279661091b17ce1b697fe0
|
|
|
|
|
|
Currently, we have a problem because the unregistration happens in the
"post deploy" phase, which works fine when the top-level stack is being
deleted, but not when the ResourceGroup of servers is being scaled down,
because then the normal "post deploy" update ordering is respected and
we try to unregister after the corresponding server has been deleted.
So, instead, register/unregister each node inside the unit of scale,
e.g the role template being scaled down, which is possible via the new
NodesExtraConfig interface, which means unregistration will take
place at the right time both on stack delete and on scale-down.
Change-Id: I8f117a49fd128f268659525dd03ad46ba3daa1bc
|
|
Currently package updates won't occur on a single node
non-HA pacemaker managed Controller because stopping
the node loses the quorum of 1.
This change gets the count of current nodes in the cluster and
if the count is 1 then specify --force when doing a pcs cluster stop.
Change-Id: I0de2488e24f1ef53a935dbc90ec6de6142bb4264
|
|
This change adds alternative logic for handling package updates
on a pacemaker managed node.
"yum list updates" is now run and this script exits early if
there are no packages to update.
If the pacemaker service is not running then the previous puppet
logic remains, so a package update is performed which excludes packages
managed by puppet, and a flag is set to indicate that puppet should
perform an ensure=>latest on all packages it manages.
However if the pacemaker service is running, the following occurs:
- pcs cluster stop is run for this node
- a full yum update is performed
- pcs cluster start is run for this node
- pcs status is run until the hostname for this node appears in the
Online list
This means that puppet is not involved in the package update process when
the node is managed by pacemaker.
Change-Id: I5ad118552d053dbda280978751167d9fd9da9874
|
|
This change updates yum_update.sh so that we set set a boolean
output when "managed" packages should get updated. The
output is named 'update_managed_packages' and for the
puppet implementation it is wired up so that it
directly sets tripleo::packages::enable_upgrade to
control whether packages are updated.
It also modifies yum_update.sh to build a yum update excludes list for
packages managed by puppet. The exclude lists are being
generated via puppet-tripleo as well via the new 'write_package_names'
function that is now wired into all the role manifests.
This change does not actually trigger the puppet apply. The fix for
Related-Bug: #1463092 will be used to trigger the puppet run when the
hiera changes. As a minor tweak to this logic we append the
UpdateIdentifier to the config_identifier so that we ensure
puppet gets executed on an update where other (non-related)
hiera changes also occur.
Co-Authored-By: Dan Prince <dprince@redhat.com>
Change-Id: I343c3959517eae38bbcd43648ed56f610272864d
|
|
Adds hook to enable additional "AllNodes" config to be performed prior
to applying puppet - this is useful when you need to build
configuration data which requires knowledge of all nodes in a cluster,
or of the entire deployment.
As an example, there is a sample config template which collects the
hostname and mac addresses for all nodes in the deployment then writes
the data to all Controller nodes. Something similar to this may be
required to enable creation of the nexus_config in
https://review.openstack.org/#/c/198754/
There's also another, simpler, example which shows how you could share
the output of an OS::Heat::RandomString between nodes.
Change-Id: I8342a238f50142d8c7426f2b96f4ef1635775509
|
|
In the case of using portal registration with an
activation key, the RHEL registration script is still
executing a `subscription-manager attach` command. This
should not happen if an activation key is provided. This
is because an activation key already provides the
subscriptions to attach.
Change-Id: I2907bede28a9b7bef71cedeea69c876eb4949df0
|
|
The recently added cinder-netapp extraconfig contains some additional
hieradata which needs to be applied during the initial pre-deployment
phase, e.g in controller-puppet.yaml (before the manifests are applied)
so wire in a new OS::TripleO::ControllerExtraConfigPre provider resource
which allows passing in a nested stack (empty by default) which contains
any required "pre deployment" extraconfig, such as applying this hieradata.
Some changes were required to the cinder-netapp extraconfig and environment
such that now the hieradata is actually applied, and the parameter_defaults
specified will be correctly mapped into the StructuredDeployment.
Change-Id: I8838a71db9447466cc84283b0b257bdb70353ffd
|
|
Currently we've got a mix of SoftwareConfig resource with
StructuredDeployments resources - while this will work it's
inconsistent and normally using the corresponding
SoftwareDeployments resouce is encourgaged instead.
Change-Id: I308d62d4ff491c073e3e8650fd4c2c65bf96d14a
|
|
|
|
This change adds config and deployment resources to trigger package
updates on nodes. The deployments are triggered by doing a stack-update
and setting one of the parameters to a unique value.
The intent is that rolling update will be controlled by setting
breakpoints on all of the UpdateDeployment resources inside the
role resource groups.
Change-Id: I56bbf944ecd6cbdbf116021b8a53f9f9111c134f
|
|
Enables support for configuring Cinder with a NetApp backend.
This change adds all relevant parameters for:
- Clustered Data ONTAP (NFS, iSCSI, FC)
- Data ONTAP 7-Mode (NFS, iSCSI, FC)
- E-Series (iSCSI)
Change-Id: If6c6e511ef2d26c4794e3b37c61e5318485ff4db
|
|
Adds a potential usage of the post-deploy hooks to register a server
with RHN or a satellite.
Note this requires some additional parameters, which can be specified in
environment_rhel_reg.yaml, and this must be passed into the call to heat
via another -e parameter. An alternative may be to have a global
extraconfig_env.yaml at the top level, which the scripts always pass, or
to use the global environment (/etc/heat/environment.d/default.yaml) on
the seed.
Co-Authored-By: James Slagle <jslagle@redhat.com>
Change-Id: Ia6fd270122cbc2e51beb672654e5e1ebd3bd2966
|
|
Adds optional hooks which can run operator defined additional config on
nodes after the application deployment has completed.
Change-Id: I3f99e648efad82ce2cd51e2d5168c716f0cee8fe
|