Age | Commit message (Collapse) | Author | Files | Lines |
|
See RHBZ 1311005 and 1247303. In short: sometimes when a controller
node gets fenced, rabbitmq is unable to rejoin the cluster. To fix this
we need two steps:
1) The fix for the RA in BZ 1247303
2) Add notify=true to the meta parameters of the rabbitmq resource on
fresh installs and updates
Note that if this change is applied on systems that do not
have the fix for the rabbitmq resource agent, no action is taken.
So when the resource agent will be updated, the notify
operation will start to work as soon as the first monitor
action will take place.
Fixes RH Bug #1311005
Change-Id: I513daf6d45e1a13d43d3c404cfd6e49d64e51d5a
|
|
The maximum payload size of the return signal from a Heat software
deployment is 1MB, and the output of yum starts breaking this limit at
~1000 packages to update - which is not an atypical number. To prevent
this, pass the -q (quiet) option to reduce the amount of output to a
manageable level.
Change-Id: I517271e8465885421a78b73c5af756816c37a977
Resolves-rhbz: #1304878
Closes-Bug: #1543034
|
|
With I02f7cf07792765359f19fdf357024d9e48690e42[1] in puppet-tripleo,
puppet is capable of updating all packages itself on non controller
nodes now.
This is a safer mechanism than using the exclude logic in yum_update.sh
since that can cause depdency problems across sub packages.
[1] https://review.openstack.org/#/c/261041/
Closes-Bug: 1534785
Change-Id: I9075a1bb85baa65a9d0afc5d0fd31a1f99a98819
|
|
Based on observed timeouts during updates bump the stop and start
timeouts for pacemaker service resources (via op_params) to 200.
This is based on the reasoning that the full timeout may be as
long as two elapsed timeout intervals. After an initial timeout,
the sigterm that follows is then allowed another
DefaultTimeoutStopSec seconds. The 200s is produced by allowing
this 2xDefaultTimeoutStopSec (@90s for systemd) and some
scheduling delta. Many thanks to Michele Baldessari.
Closes-Bug: 1531204
Change-Id: If6b43982c958f63bc78ad997400bf1279c23df7e
|
|
Occasionally we hit "Error: unable to push cib" during update. This is
probably due to the fact that when we try to replace cib in
yum_update.sh, services on the previous updated controller are still
coming up and changing cib, and racing/conflicting with the cib push
from yum_update.sh.
This commit adds waiting for the cluster to settle before exiting from
yum_update.sh, to avoid this kind of conflict.
Also a check for cib-push success is added, to make the update fail
properly instead of hanging indefinitely as we've observed with this
issue.
Change-Id: I953087e0e565474ac553fd57bea2459d2e3a6081
Closes-Bug: #1527644
|
|
|
|
|
|
When the Overcloud does not host an instance of haproxy, pcmk will
not have any resource named haproxy-clone so we should not add
any constraint relying on it.
Change-Id: I801f07b7570f3805aa71c22998fec6b6f192b350
|
|
|
|
We forgot to apply the mongod timeout in the cib dump first, to
apply it later in a single cib-push step.
Change-Id: Ib104e51782c6d3f646907cdb06c74fd4cbf9028c
|
|
Older neutron versions have a bug which makes them leave keepalived and
radvd running even after all neutron services are stopped, preventing
neutron router failover from happening. Router can then get stuck on the
inactive node, like this:
[stack@instack ~]$ neutron l3-agent-list-hosting-router default_router
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 48ca9477-b93b-4305-9e6d-9f1c5d3388f0 | overcloud-controller-1.localdomain | True | :-) | standby |
| eba0575c-654f-4da6-b1cd-f7fdf1cd3726 | overcloud-controller-2.localdomain | True | :-) | standby |
| 68815390-251f-4425-a5f8-38bdbf3bdb90 | overcloud-controller-0.localdomain | True | xxx | active |
+--------------------------------------+------------------------------------+----------------+-------+----------+
We need to kill the leftover processes manually to prevent the state
described above from happening.
See https://review.gerrithub.io/#/c/248931
Change-Id: I2deaa176222983daa0c33ab52a6aa5dbe7365302
|
|
The neutron pcs constraints were reworked in
https://review.openstack.org/#/c/229466/
For overclouds deployed with older tripleo-heat-templates the
current pcs ordering constraints will not have those changes,
meaning that the behaviour discussed at
https://bugs.launchpad.net/tripleo/+bug/1501378 is likely
given we will stop and restart all services. This review
applies those, in short, remove the ovs-cleanup after
neutron-server and add openvswitch-agent instead. Detail in
the bug report and linked BZ.
Change-Id: I45822c5fe9029f11635400b7fbd386880ac80a4e
Related-Bug: 1501378
|
|
To avoid pcmk reconfiguring the resources on each config change,
we want to apply the constraints and timeouts from file.
We also *do not* want to alter the timeouts for a few ocf resources
which are rabbitmq, neutron-netns-cleanup and neutron-ovs-cleanup
Change-Id: I6875f19e1f34f0fdcf0928421f49b61d857ca7c8
Co-Authored-By: Andrew Beekhof <abeekhof@redhat.com>
|
|
When the cluster is brought back online after a yum update in
yum_update.sh, we should verify that galera is fully sync'd before
moving on. This ensures the sync is complete before moving on to update
any other nodes in the cluster.
Change-Id: Ie8fc2c5d5214deacea94ca658ac75359b318ced1
|
|
This matches change I6fc18f1ad876c5a25723710a3b20d8ec9519dcba, but we
need it to set it before attempting the cluster stop - yum update -
cluster start cycle, to make sure this cycle doesn't hit the low timeout
limits.
This can be removed once updates from deployments made prior to
I6fc18f1ad876c5a25723710a3b20d8ec9519dcba are no longer supported.
Change-Id: I587136d8d045d213875c657ea5a405074f80c8ad
|
|
Some missing pacemaker constraints were added in the following commits:
https://review.openstack.org/#/c/219770/
https://review.openstack.org/#/c/219665/
https://review.openstack.org/#/c/218931/
https://review.openstack.org/#/c/218930/
Overclouds that were deployed prior to these constraints being added to
tripleo-heat-templates still have the constraints missing. During an
update, stopping and starting the cluster can fail without these
constraints in place. As a workaround, conditionally add these
contraints in yum_update.sh so that we're sure they're always present
before updating.
Change-Id: Id46c85dbbe5e85d362279661091b17ce1b697fe0
|
|
Currently package updates won't occur on a single node
non-HA pacemaker managed Controller because stopping
the node loses the quorum of 1.
This change gets the count of current nodes in the cluster and
if the count is 1 then specify --force when doing a pcs cluster stop.
Change-Id: I0de2488e24f1ef53a935dbc90ec6de6142bb4264
|
|
This change adds alternative logic for handling package updates
on a pacemaker managed node.
"yum list updates" is now run and this script exits early if
there are no packages to update.
If the pacemaker service is not running then the previous puppet
logic remains, so a package update is performed which excludes packages
managed by puppet, and a flag is set to indicate that puppet should
perform an ensure=>latest on all packages it manages.
However if the pacemaker service is running, the following occurs:
- pcs cluster stop is run for this node
- a full yum update is performed
- pcs cluster start is run for this node
- pcs status is run until the hostname for this node appears in the
Online list
This means that puppet is not involved in the package update process when
the node is managed by pacemaker.
Change-Id: I5ad118552d053dbda280978751167d9fd9da9874
|
|
This change updates yum_update.sh so that we set set a boolean
output when "managed" packages should get updated. The
output is named 'update_managed_packages' and for the
puppet implementation it is wired up so that it
directly sets tripleo::packages::enable_upgrade to
control whether packages are updated.
It also modifies yum_update.sh to build a yum update excludes list for
packages managed by puppet. The exclude lists are being
generated via puppet-tripleo as well via the new 'write_package_names'
function that is now wired into all the role manifests.
This change does not actually trigger the puppet apply. The fix for
Related-Bug: #1463092 will be used to trigger the puppet run when the
hiera changes. As a minor tweak to this logic we append the
UpdateIdentifier to the config_identifier so that we ensure
puppet gets executed on an update where other (non-related)
hiera changes also occur.
Co-Authored-By: Dan Prince <dprince@redhat.com>
Change-Id: I343c3959517eae38bbcd43648ed56f610272864d
|
|
This change adds config and deployment resources to trigger package
updates on nodes. The deployments are triggered by doing a stack-update
and setting one of the parameters to a unique value.
The intent is that rolling update will be controlled by setting
breakpoints on all of the UpdateDeployment resources inside the
role resource groups.
Change-Id: I56bbf944ecd6cbdbf116021b8a53f9f9111c134f
|