Age | Commit message (Collapse) | Author | Files | Lines |
|
In change I2aae4e2fdfec526c835f8967b54e1db3757bca17 we did the
following:
-pacemaker_status=$(systemctl is-active pacemaker || :)
+pacemaker_status=""
+if hiera -c /etc/puppet/hiera.yaml service_names | grep -q pacemaker;
then
+ pacemaker_status=$(systemctl is-active pacemaker)
+fi
we did that so due to LP#1668266: we did not want systemctl is-active to
fail on non pacemaker nodes. The problem with the above hiera check is
that it will match on pacemaker_remote nodes as well.
We cannot piggyback the pacemaker_enabled hiera key because that is true
on all nodes. So let's make the test check only for pacemaker service
without matching pacemaker remote. Tested with:
1) Test on a controller node with pacemaker service enabled
[root@overcloud-controller-0 ~]# hiera -c /etc/puppet/hiera.yaml -a service_names |grep '\bpacemaker\b'
"pacemaker",
[root@overcloud-controller-0 ~]# echo $?
0
2) Test on a compute node without pacemaker:
[root@overcloud-novacompute-0 puppet]# hiera -c /etc/puppet/hiera.yaml service_names |grep '\bpacemaker\b'
[root@overcloud-novacompute-0 puppet]# echo $?
1
3) Test on a node with pacemaker_remote in the service_names key:
[root@overcloud-novacompute-0 puppet]# hiera -c /etc/puppet/hiera.yaml service_names |grep '\bpacemaker\b'
[root@overcloud-novacompute-0 puppet]# echo $?
1
[root@overcloud-novacompute-0 puppet]# hiera -c /etc/puppet/hiera.yaml service_names |grep '\bpacemaker_remote\b'
"pacemaker_remote"]
[root@overcloud-novacompute-0 puppet]# echo $?
0
Change-Id: I54c5756ba6dea791aef89a79bc0b538ba02ae48a
Closes-Bug: #1688214
|
|
To test this change we deployed a stock master with ipv6 which created a bunch
of ipv6 with /64 netmask:
[root@overcloud-controller-0 ~]# pcs resource show ip-fd00.fd00.fd00.2000..18
Resource: ip-fd00.fd00.fd00.2000..18 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=fd00:fd00:fd00:2000::18 cidr_netmask=64
Operations: start interval=0s timeout=20s (ip-fd00.fd00.fd00.2000..18-start-interval-0s)
stop interval=0s timeout=20s (ip-fd00.fd00.fd00.2000..18-stop-interval-0s)
monitor interval=10s timeout=20s (ip-fd00.fd00.fd00.2000..18-monitor-interval-10s)
Then we update the THT folder with this patch and upload the new scripts on the undercloud via:
openstack overcloud deploy --update-plan-only ....
Then we kick off the minor update workflow:
openstack overcloud update stack -i overcloud
Once the controller-0 node (bootstrap node for pacemaker) is completed we have the
correct VIP configuration:
[root@overcloud-controller-0 heat-config-script]# pcs resource show ip-fd00.fd00.fd00.2000..18
Resource: ip-fd00.fd00.fd00.2000..18 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=fd00:fd00:fd00:2000::18 cidr_netmask=128 nic=vlan20 lvs_ipv6_addrlabel=true lvs_ipv6_addrlabel_value=99
Operations: start interval=0s timeout=20s (ip-fd00.fd00.fd00.2000..18-start-interval-0s)
stop interval=0s timeout=20s (ip-fd00.fd00.fd00.2000..18-stop-interval-0s)
monitor interval=10s timeout=20s (ip-fd00.fd00.fd00.2000..18-monitor-interval-10s)
Also verified that running the script a second time does not alter the
(already fixed) VIPs.
Co-Authored-By: Damien Ciabrini <dciabrin@redhat.com>
Change-Id: I765cd5c9b57134dff61f67ce726bf88af90f8090
|
|
We fixed pcs resources start/stop timeouts via
I587136d8d045d213875c657ea5a405074f80c8ad in Nov 2015.for mitaka.
And there we stated:
This can be removed once updates from deployments made prior to
I6fc18f1ad876c5a25723710a3b20d8ec9519dcba are no longer supported.
We can now safely remove these updates as they are useless and cost time
anyway.
Change-Id: Ibad2b3eed0d08560d52d5ebe700746b61e5b8f51
|
|
The current check tends to produce a false positive causing unnecessary
service restarts. yum check-update will exit with return code 100 if
updated packages are available.
Change-Id: I8bd89f2b24bafc6c991382b9eb484cfa9a2f8968
|
|
|
|
In [1] we removed the previously used special case upgrade code.
However we have since discovered that for openvswitch 2.5.0-14
the special case is still required with an extra flag to prevent
the restart. This adds the upgrade code back into the minor
update and 'manual upgrade' scripts for compute/swift. The
review at If998704b3c4199bbae8a1d068c31a71763f5c8a2 is adding
this logic for the ansible upgrade steps.
Related-Bug: 1669714
[1] https://review.openstack.org/#/q/59e5f9597eb37f69045e470eb457b878728477d7
Change-Id: I3e5899e2d831b89745b2f37e61ff69dbf83ff595
|
|
Attempt to check galera's cluster status fails when galera service
is not running on the same node.
Change-Id: I27fb0841d85cd0dc86e92ac2e21eedf5f8f863ab
|
|
The UpdateDeployment already depends on NetworkDeployment.
We should not run os-net-config unconditionally before update.
Closes-Bug: #1666227
Change-Id: I48cbf5de00d47c6fdad71ff24c00e9db05cec5d5
|
|
|
|
Removed from the tripleo_upgrade_node.sh (major upgrade) & yum_update.sh
(minor update). The workaround is no longer needed and in fact has the
opposite effect killing connectitivity to the node. The 'normal' yum
update on nodes delivers the latest openvswitch 2.6.1 with no drama.
Also adds a 'complete' message, some extra debug echo for logs
and removes the python-zaqarclient install no longer needed
Closes-Bug: 1669714
Change-Id: Icd1517bcade36781fa0da21d045ffd9ec68efc38
|
|
Package update fails on compute node, when yum_update checks for
pacemaker status via systemctl command. Because exit on error (-e)
option has been enabled recently, this issue is happening. Fixing
by, executing the command only on nodes where pacemaker is enabled.
Closes-Bug: #1668266
Change-Id: I2aae4e2fdfec526c835f8967b54e1db3757bca17
|
|
|
|
We only need to know if pacemaker service is in active state.
Change-Id: Id5e16f2bbbe51b8a0c250eb5d35e89e61a7b3383
Resolves: rhbz#1414779
Closes-Bug: #1656980
|
|
For now, don't run anything in yum_update.sh when it is run from
inside the heat-agents container. A mechanism for doing a yum update
on the host can be worked out later, but for now a yum update should
never be run inside a container.
Change-Id: I73d37578f8b2dc9c3029b968b1ef74ef4894100a
|
|
In I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae and
I11fcf688982ceda5eef7afc8904afae44300c2d9 we added a manual step
for upgrading openvswitch in order to specify the --nopostun
as discussed in the bug below.
This change adds a minor update to make this workaround more
robust. It removes any existing rpms that may be around from
an earlier run, and also checks that the rpms installed are
at least newer than the version we are on.
This also refactors the code into a common definition in the
pacemaker_common_functions.sh which is included even for the
heredocs generating upgrade scripts during init. Thanks
Sofer Athlan-Guyot and Jirka Stransky for help with that.
Change-Id: Idc863de7b5a8c116c990ee8c1472cfe377836d37
Related-Bug: 1635205
|
|
Running os-net-config before restarting the cluster prevents changes to
the interface files caused by changes to implementation from bouncing
network interfaces after the cluster has restarted.
Closes-Bug: #1644138
Change-Id: I65fb104465ff3d37ddc791634302994334136014
|
|
In I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae and
I11fcf688982ceda5eef7afc8904afae44300c2d9 we landed a workaround
for the openvswitch 2.4 to 2.5 upgrade discussed in the bug below.
Unfortunately testing has revealed a problem with the minor update
case specifically for non controllers. It seems we would exit
before the ovs workaround has had a chance to execute. This moves
the block up a few lines to avoid this condition. As with the
other two reviews noted here, this will need to go into newton
and then mitaka too.
Change-Id: If905de82d96302334ebe02de9c43f00faed9b72b
Related-Bug: 1635205
|
|
OPM package is metadata package with unversioned requirements which
means that update does not update the dependencies. This leaves us
with old puppet modules and old puppet during the puppet run.
Change-Id: I80f8a73142a09bb4178bb5a396d256ba81ba98a8
Closes-Bug: #1638266
Resolves: rhbz#1390559
|
|
rpm command will return an exit 1 if ovs package is already
there and will exit the step_1.sh script. To get around this
force the update with --replacepkgs
Also remove the \ just before the $ which cause a syntax
error for the ceph storage
Change-Id: I11fcf688982ceda5eef7afc8904afae44300c2d9
Closes-bug: 1636748
|
|
With the following two changes we increased the timeout for redis and
rabbit for both starting and stopping to 200s:
https://review.openstack.org/386618 newton (merged)
https://review.openstack.org/385555 master (merged)
We want to also fix that on minor updates on all our supported
releases upstream and downstream (newton, mitaka, liberty, kilo).
This way we can guarantee that we have a uniform timeout for
sart and stop for rabbit and redis across all our releases.
Change-Id: If59bf3386832ee78d3a654f01077aff2e8be76e8
Closes-Bug: #1634851
|
|
This adds a special case handling for the opensvswitch package
as discussed at the related bug below.
This is added/handled here for both the minor update and the
major mitaka...newton upgrade.
Change-Id: I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae
Closes-Bug: 1635205
|
|
|
|
|
|
The update and upgrade shell scripts were still referencing the
old openstack-keystone service which got removed with
Ie26908ac9bfc0b84b6b65ae3bda711236b03d9d4
Also removes kilo and liberty specific workarounds and config changes.
Change-Id: Icc80904908ee3558930d4639a21812f14b2fd12e
|
|
Ceilometer Alarm is deprecated in Liberty by Aodh.
This patch:
* manage Aodh Keystone resources
* deploy Aodh API under WSGI, Notifier, Listener and Evaluator
* manage new parameters to customize Aodh deployment
* uses ceilometer DB for the upgrade path
* pacemaker config
* Add migration logic to remove pcs resources
Depends-On: I5333faa72e52d2aa2a622ac2d4b60825aadc52b5
Depends-On: Ib6c9c4c35da3fb55e0ca8e2d5a58ebaf4204d792
Co-Authored-By: Emilien Macchi <emilien@redhat.com>
Change-Id: Ib47a22884afb032ebc1655e1a4a06bfe70249134
|
|
This just a revert to see if reverting this gets back to a normal CI run time.
This reverts commit f72aed85594f223b6f888e6d0af3c880ea581a66.
Change-Id: I04a0893f6cf69f547a4db26261005e580e1fc90b
|
|
Ceilometer Alarm is deprecated in Liberty by Aodh.
This patch:
* manage Aodh Keystone resources
* deploy Aodh API under WSGI, Notifier, Listener and Evaluator
* manage new parameters to customize Aodh deployment
* uses ceilometer DB for the upgrade path
* pacemaker config
Depends-On: I9e34485285829884d9c954b804e3bdd5d6e31635
Depends-On: I891985da9248a88c6ce2df1dd186881f582605ee
Depends-On: Ied8ba5985f43a5c5b3be5b35a091aef6ed86572f
Co-Authored-By: Pradeep Kilambi <pkilambi@redhat.com>
Change-Id: I58d419173e80d2462accf7324c987c71420fd5f6
|
|
See RHBZ 1311005 and 1247303. In short: sometimes when a controller
node gets fenced, rabbitmq is unable to rejoin the cluster. To fix this
we need two steps:
1) The fix for the RA in BZ 1247303
2) Add notify=true to the meta parameters of the rabbitmq resource on
fresh installs and updates
Note that if this change is applied on systems that do not
have the fix for the rabbitmq resource agent, no action is taken.
So when the resource agent will be updated, the notify
operation will start to work as soon as the first monitor
action will take place.
Fixes RH Bug #1311005
Change-Id: I513daf6d45e1a13d43d3c404cfd6e49d64e51d5a
|
|
The maximum payload size of the return signal from a Heat software
deployment is 1MB, and the output of yum starts breaking this limit at
~1000 packages to update - which is not an atypical number. To prevent
this, pass the -q (quiet) option to reduce the amount of output to a
manageable level.
Change-Id: I517271e8465885421a78b73c5af756816c37a977
Resolves-rhbz: #1304878
Closes-Bug: #1543034
|
|
We've seen the 360 second threshold broken and a failed update because
of that, even though Galera eventually synced fine, clusterchecks OK and
pcs status clean. This will give Galera more time to perform the sync.
Change-Id: I17207ec9b4038fb9540582c9b0b717f9b85a78b9
Closes-Bug: #1538218
|
|
With I02f7cf07792765359f19fdf357024d9e48690e42[1] in puppet-tripleo,
puppet is capable of updating all packages itself on non controller
nodes now.
This is a safer mechanism than using the exclude logic in yum_update.sh
since that can cause depdency problems across sub packages.
[1] https://review.openstack.org/#/c/261041/
Closes-Bug: 1534785
Change-Id: I9075a1bb85baa65a9d0afc5d0fd31a1f99a98819
|
|
Based on observed timeouts during updates bump the stop and start
timeouts for pacemaker service resources (via op_params) to 200.
This is based on the reasoning that the full timeout may be as
long as two elapsed timeout intervals. After an initial timeout,
the sigterm that follows is then allowed another
DefaultTimeoutStopSec seconds. The 200s is produced by allowing
this 2xDefaultTimeoutStopSec (@90s for systemd) and some
scheduling delta. Many thanks to Michele Baldessari.
Closes-Bug: 1531204
Change-Id: If6b43982c958f63bc78ad997400bf1279c23df7e
|
|
Occasionally we hit "Error: unable to push cib" during update. This is
probably due to the fact that when we try to replace cib in
yum_update.sh, services on the previous updated controller are still
coming up and changing cib, and racing/conflicting with the cib push
from yum_update.sh.
This commit adds waiting for the cluster to settle before exiting from
yum_update.sh, to avoid this kind of conflict.
Also a check for cib-push success is added, to make the update fail
properly instead of hanging indefinitely as we've observed with this
issue.
Change-Id: I953087e0e565474ac553fd57bea2459d2e3a6081
Closes-Bug: #1527644
|
|
|
|
|
|
When the Overcloud does not host an instance of haproxy, pcmk will
not have any resource named haproxy-clone so we should not add
any constraint relying on it.
Change-Id: I801f07b7570f3805aa71c22998fec6b6f192b350
|
|
|
|
We forgot to apply the mongod timeout in the cib dump first, to
apply it later in a single cib-push step.
Change-Id: Ib104e51782c6d3f646907cdb06c74fd4cbf9028c
|
|
Older neutron versions have a bug which makes them leave keepalived and
radvd running even after all neutron services are stopped, preventing
neutron router failover from happening. Router can then get stuck on the
inactive node, like this:
[stack@instack ~]$ neutron l3-agent-list-hosting-router default_router
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 48ca9477-b93b-4305-9e6d-9f1c5d3388f0 | overcloud-controller-1.localdomain | True | :-) | standby |
| eba0575c-654f-4da6-b1cd-f7fdf1cd3726 | overcloud-controller-2.localdomain | True | :-) | standby |
| 68815390-251f-4425-a5f8-38bdbf3bdb90 | overcloud-controller-0.localdomain | True | xxx | active |
+--------------------------------------+------------------------------------+----------------+-------+----------+
We need to kill the leftover processes manually to prevent the state
described above from happening.
See https://review.gerrithub.io/#/c/248931
Change-Id: I2deaa176222983daa0c33ab52a6aa5dbe7365302
|
|
The neutron pcs constraints were reworked in
https://review.openstack.org/#/c/229466/
For overclouds deployed with older tripleo-heat-templates the
current pcs ordering constraints will not have those changes,
meaning that the behaviour discussed at
https://bugs.launchpad.net/tripleo/+bug/1501378 is likely
given we will stop and restart all services. This review
applies those, in short, remove the ovs-cleanup after
neutron-server and add openvswitch-agent instead. Detail in
the bug report and linked BZ.
Change-Id: I45822c5fe9029f11635400b7fbd386880ac80a4e
Related-Bug: 1501378
|
|
To avoid pcmk reconfiguring the resources on each config change,
we want to apply the constraints and timeouts from file.
We also *do not* want to alter the timeouts for a few ocf resources
which are rabbitmq, neutron-netns-cleanup and neutron-ovs-cleanup
Change-Id: I6875f19e1f34f0fdcf0928421f49b61d857ca7c8
Co-Authored-By: Andrew Beekhof <abeekhof@redhat.com>
|
|
When the cluster is brought back online after a yum update in
yum_update.sh, we should verify that galera is fully sync'd before
moving on. This ensures the sync is complete before moving on to update
any other nodes in the cluster.
Change-Id: Ie8fc2c5d5214deacea94ca658ac75359b318ced1
|
|
This matches change I6fc18f1ad876c5a25723710a3b20d8ec9519dcba, but we
need it to set it before attempting the cluster stop - yum update -
cluster start cycle, to make sure this cycle doesn't hit the low timeout
limits.
This can be removed once updates from deployments made prior to
I6fc18f1ad876c5a25723710a3b20d8ec9519dcba are no longer supported.
Change-Id: I587136d8d045d213875c657ea5a405074f80c8ad
|
|
Some missing pacemaker constraints were added in the following commits:
https://review.openstack.org/#/c/219770/
https://review.openstack.org/#/c/219665/
https://review.openstack.org/#/c/218931/
https://review.openstack.org/#/c/218930/
Overclouds that were deployed prior to these constraints being added to
tripleo-heat-templates still have the constraints missing. During an
update, stopping and starting the cluster can fail without these
constraints in place. As a workaround, conditionally add these
contraints in yum_update.sh so that we're sure they're always present
before updating.
Change-Id: Id46c85dbbe5e85d362279661091b17ce1b697fe0
|
|
Currently package updates won't occur on a single node
non-HA pacemaker managed Controller because stopping
the node loses the quorum of 1.
This change gets the count of current nodes in the cluster and
if the count is 1 then specify --force when doing a pcs cluster stop.
Change-Id: I0de2488e24f1ef53a935dbc90ec6de6142bb4264
|
|
This change adds alternative logic for handling package updates
on a pacemaker managed node.
"yum list updates" is now run and this script exits early if
there are no packages to update.
If the pacemaker service is not running then the previous puppet
logic remains, so a package update is performed which excludes packages
managed by puppet, and a flag is set to indicate that puppet should
perform an ensure=>latest on all packages it manages.
However if the pacemaker service is running, the following occurs:
- pcs cluster stop is run for this node
- a full yum update is performed
- pcs cluster start is run for this node
- pcs status is run until the hostname for this node appears in the
Online list
This means that puppet is not involved in the package update process when
the node is managed by pacemaker.
Change-Id: I5ad118552d053dbda280978751167d9fd9da9874
|
|
This change updates yum_update.sh so that we set set a boolean
output when "managed" packages should get updated. The
output is named 'update_managed_packages' and for the
puppet implementation it is wired up so that it
directly sets tripleo::packages::enable_upgrade to
control whether packages are updated.
It also modifies yum_update.sh to build a yum update excludes list for
packages managed by puppet. The exclude lists are being
generated via puppet-tripleo as well via the new 'write_package_names'
function that is now wired into all the role manifests.
This change does not actually trigger the puppet apply. The fix for
Related-Bug: #1463092 will be used to trigger the puppet run when the
hiera changes. As a minor tweak to this logic we append the
UpdateIdentifier to the config_identifier so that we ensure
puppet gets executed on an update where other (non-related)
hiera changes also occur.
Co-Authored-By: Dan Prince <dprince@redhat.com>
Change-Id: I343c3959517eae38bbcd43648ed56f610272864d
|
|
This change adds config and deployment resources to trigger package
updates on nodes. The deployments are triggered by doing a stack-update
and setting one of the parameters to a unique value.
The intent is that rolling update will be controlled by setting
breakpoints on all of the UpdateDeployment resources inside the
role resource groups.
Change-Id: I56bbf944ecd6cbdbf116021b8a53f9f9111c134f
|