Age | Commit message (Collapse) | Author | Files | Lines |
|
If "pcs cluster stop --all" is executed on a controller that
happens to have a VIP on the internal network, pcs may use the
VIP as the source address for communication with another cluster
node. When pacemaker is stopped this VIP goes away, and pcs never
receives a response from the other node. This causes pcs to hang
indefinitely; eventually the upgrade times out and fails.
Disabling the VIPs before stopping the cluster avoids this
situation.
Change-Id: I6bc59120211af28456018640033ce3763c373bbb
Closes-Bug: 1577570
|
|
|
|
|
|
Previously ceilometer-notification, aodh-listener and sahara-engine
didn't have constraints that would anchor them under openstack-core
dummy resource. Such constraints are added now. (sahara-engine starting
after sahara-api, aodh-listener after aodh-evaluator, and
ceilometer-notification after openstack-core.) Openstack-core ->
heat-api constraint has been removed because heat-api depends on
ceilometer-notification, so there's a transitive dependency on
openstack-core already.
Change-Id: Ided7321ebbf2c3556726343b4bb466fd8759b43a
Closes-Bug: #1569444
|
|
Previously we tried to use UpdateIdentifier for two different things:
tell whether to perform package update, and also to tell whether the
top-level stack is being created or updated (which was incorrect and
resulted in bug 1567384, and an attempt to work around that bug resulted
in bug 1567385).
We cannot use Heat's "action" conditionals in some cases, because they
refer to the direct parent stack, which can yield undesirable results
when introducing new nested stacks or temporarily no-opping something
and then adding it back (in both these cases, "action" would be
considered "CREATE", even though the top-level stack is in "UPDATE").
So tripleoclient passes a new parameter StackAction to tell whether the
top-level stack is being created or updated, and we make use of
that. (It seems there's no better way of getting this info from within
the nested Heat stacks.)
Change-Id: Ie14ddbff15e7ed21aaa3fcdacf36e0040f912382
Depends-On: I9dc3b4cd8a6a71df34d8babf0e4c6505041f5311
Closes-Bug: #1567384
Related-Bug: #1567385
|
|
The update and upgrade shell scripts were still referencing the
old openstack-keystone service which got removed with
Ie26908ac9bfc0b84b6b65ae3bda711236b03d9d4
Also removes kilo and liberty specific workarounds and config changes.
Change-Id: Icc80904908ee3558930d4639a21812f14b2fd12e
|
|
Removes the old noop nested stack template for extraconfig
tasks and instead uses OS::Heat::None. This should avoid a few
extra resource checks on create and update.
Change-Id: I5a42fc78ece2553e86385236e214aa1e3c91cd85
|
|
The change at https://review.openstack.org/#/c/302352/ should stop
the if up/down scripts from making changes to resolv.conf as
discussed in that review and the related bug below. However during
upgrades, as we are moving from a version of the ifcfg-vlanXX files
that don't have the PEERDNS=no added by /#/c/302352 the if up
script will restore the /etc/resolv.conf.save to /etc/resolv.conf
and overwrite it. This removes the .save file during the upgrade
init command which gets delivered to all nodes as the first stage
of a major upgrade.
Change-Id: I91dd139f43be4912c20d8661691bee2b662964d4
Related-Bug: 1567004
|
|
While having extra customizations inside a TripleO deployed
Pacemaker environment, say you have instance HA with
pacemaker_remoted or you need to configure an external arbitrator
for something, then the status of the resources for remote nodes
is "Stopped".
This leads to failures while, for example, scaling up.
This fixes the way status is checked, filtering just local nodes.
Co-Authored-By: Giulio Fidente <gfidente@redhat.com>
Change-Id: I8dc25f5d7031c265858afd5a266fda5315ae37a0
|
|
If a certificate expires, the user will need to update it. However,
because we only restart services at the end of a stack-update the
new certificate doesn't take effect until after puppet has run.
This is a problem because puppet makes OpenStack calls, which will
fail if the certificate is expired. In that case we never get to
the service restart so the stack is wedged until the user manually
restart haproxy.
This patch addresses the problem by reloading haproxy before puppet
runs. This is done in a pre-puppet script for pacemaker after pacemaker
is maintenance mode because we need to make sure it happens after all of
the certs have been installed on the controllers, but before puppet
runs.
For non-pacemaker, haproxy is simply reloaded.
Change-Id: Id5ed05b3a20d06af8ae7a3d6f859b03399b0d77d
|
|
We'd like to let the post puppet pacemaker controller services
restart to happen for the convergence step so set the
UpdateIdentifier. However also set the PackageUpdate to noop so the
yum_update.sh doesn't happen.
Since a full haproxy restart is expected, we no longer need the
systemctl reload added at Iae3bad745ecdf952a7a0314fe1375d07eb47c454
so remove that too.
Some more context at
https://bugzilla.redhat.com/show_bug.cgi?id=1321036
Co-Authored-By: marios <marios@redhat.com>
Change-Id: I31c2d97d68c97b435f63863fae2c89f18f99681d
|
|
|
|
|
|
As discussed in the related bug below, after upgrading your
environment to latest liberty the haproxy config isn't picked
up. This adds a systemctl reload haproxy in the pacemaker
resource restart we run as part of the post-puppet-pacemaker.
Related-Bug: 1561012
Change-Id: Iae3bad745ecdf952a7a0314fe1375d07eb47c454
|
|
Ceilometer Alarm is deprecated in Liberty by Aodh.
This patch:
* manage Aodh Keystone resources
* deploy Aodh API under WSGI, Notifier, Listener and Evaluator
* manage new parameters to customize Aodh deployment
* uses ceilometer DB for the upgrade path
* pacemaker config
* Add migration logic to remove pcs resources
Depends-On: I5333faa72e52d2aa2a622ac2d4b60825aadc52b5
Depends-On: Ib6c9c4c35da3fb55e0ca8e2d5a58ebaf4204d792
Co-Authored-By: Emilien Macchi <emilien@redhat.com>
Change-Id: Ib47a22884afb032ebc1655e1a4a06bfe70249134
|
|
|
|
|
|
|
|
|
|
Yum update on cinder nodes should be quiet, as it is on controllers,
because results of these updates are sent to Heat. I mistakenly left
this out in the first patch because i used one of the standalone node
upgrade scripts as a copy/paste base for the cinder node upgrade script.
Change-Id: Id13190dc4d242317829c7994088183f52d21461d
|
|
|
|
The variables in the heredoc should be escaped because they should
evaluate only when the inner script runs, not when the outer "writer"
script runs.
Python-zaqarclient is installed for os-collect-config to work, as we do
on the other node types.
Swift-proxy is removed from list of services to stop/start, as
swift-proxy isn't supposed to run on the swift storage nodes.
Change-Id: I8426b859d11378ebdc3da94dcc090133dab0c628
|
|
During the controller upgrade in
major_upgrade_controller_pacemaker_1.sh we use systemctl to stop
all swift services and then start them again in _pacemaker_2.sh
In the case of stand-alone swift nodes the deployer may have
used the ControllerEnableSwiftStorage: false so that only the
swift-proxy service is left on controllers (wrt swift). The
systemctl_swift function used during upgrades is changed to factor
this in.
Change-Id: Ib22005123429f250324df389855d0dccd2343feb
|
|
This allows to run a command or a script snippet on all overcloud nodes
at the beginning of the upgrade. The intended use is to switch to a new
set of repositories on the overcloud. This is done differently in
different contexts (e.g. upstream vs. downstream), but generally it
should be simple enough to not warrant creation of switchable
"UpgradeInit" resource in the resource registry, and a string
command/snippet parameter should suffice.
Change-Id: I72271170d3f53a5179b3212ec9bae9a6204e29e6
|
|
This adds delivery of an upgrade script to any ceph-storage nodes
during the script delivery that comes first during the upgrade
workflow.
The controllers have the ceph-mon whilst the ceph-osds are on the
ceph-storage nodes. The ceph-mons will be updated first as part of
the heat-driven controller upgrade, and ceph-osds on ceph nodes are
upgraded with the upgrade-non-controller.sh tripleo-common script
as with compute and swift nodes.
Also slight rename for the ObjectStorageConfig/Deployment here for
consistency.
Change-Id: I12abad5548dcb019ade9273da06fe66fd97f54cc
|
|
|
|
|
|
|
|
This just a revert to see if reverting this gets back to a normal CI run time.
This reverts commit f72aed85594f223b6f888e6d0af3c880ea581a66.
Change-Id: I04a0893f6cf69f547a4db26261005e580e1fc90b
|
|
|
|
Ceilometer Alarm is deprecated in Liberty by Aodh.
This patch:
* manage Aodh Keystone resources
* deploy Aodh API under WSGI, Notifier, Listener and Evaluator
* manage new parameters to customize Aodh deployment
* uses ceilometer DB for the upgrade path
* pacemaker config
Depends-On: I9e34485285829884d9c954b804e3bdd5d6e31635
Depends-On: I891985da9248a88c6ce2df1dd186881f582605ee
Depends-On: Ied8ba5985f43a5c5b3be5b35a091aef6ed86572f
Co-Authored-By: Pradeep Kilambi <pkilambi@redhat.com>
Change-Id: I58d419173e80d2462accf7324c987c71420fd5f6
|
|
This commit introduces a bash file to be sourced into major upgrade
scripts. Into this file we can put specific pieces of migration logic in
the form of bash functions, which can then be called from the upgrade
scripts.
Change-Id: Ibf7aa84d3880e9218c488dec9d707300e1784744
|
|
This splits the upgrade script delivery out of the UpgradeWorkflow
and into a new task which delivers the upgrade script for
compute and object-storage nodes. This is intended to be the first
part of the upgrades process, since we need to upgrade swift nodes
before the controllers and then only one at a time. So this will
deliver the upgrade script which can be invoked by the operator
using the existing script in tripleo-common
'upgrade-non-controller.sh'.
This can be invoked by passing the -e
environments/major-upgrade-script-delivery.yaml (added here) to
the openstack overcloud deploy command.
Change-Id: I20a0d4978e907111404f8108c502ab53b69a3296
|
|
This introduces upgrades for Cinder block storage nodes. Currently
Cinder doesn't support upgrade level pinning and cannot safely deal with
version skew. This means that we have to upgrade Cinder storage nodes in
sync with controller nodes (after they were taken down for upgrade,
before they are brought back up) to ensure that Cinder services perform
AMQP communication only within the same major version of Cinder.
According to our current knowledge, Cinder block storage nodes are the
only node type that will have to be upgraded in sync with controllers.
Change-Id: Icec913c015eff744b0f31b513176b4b657df43af
|
|
Since swift isn't managed by pacemaker we need to manually (systemctl)
stop and start the swift services. This moves the duplicate blocks for
start/stop into a common function (we already include that
pacemaker_common_functions.sh here so may as well)
Change-Id: Ic4f23212594c1bf9edc39143bf60c7f6d648fd1d
|
|
Old overcloud images don't have python-zaqarclient installed, and new
overclouds' os-collect-config are configured with Zaqar support. This
together means that on upgrade we need to install python-zaqarclient,
otherwise os-collect-config will be restarted during yum update and
crash due to trying to import missing Python module from zaqarclient.
Change-Id: I3e875e14cb60b1b78aec0d9ddc412ccf865abd01
|
|
Quiet down yum during major upgrades to reduce the output size. This is
consistent with what was introduced into minor updates in change
I517271e8465885421a78b73c5af756816c37a977.
Change-Id: Ie6b470e383fdf42870ac6f60ca43e44b4c446ebe
|
|
|
|
|
|
As part of the major upgrade workflow non-controller nodes are to
be updated by the operator, out-of-band and only after an initial
heat stack-update that invokes the upgrade of the controller nodes.
This review adds a ComputeDeliverUpgradeConfigDeployment_Step3
SoftwareDeploymentGroup to be applied only to compute nodes, and
that depends on the controllers having been upgraded after
ControllerPacemakerUpgradeConfig_Step2.
Its purpose is to deliver but not invoke the upgrade script on
compute nodes to /root/tripleo_upgrade_node.sh .
The non-controller nodes will then be upgraded later by an
operator that will run the script provided for that purpose, like
at https://review.openstack.org/#/c/284722/1 for example.
Change-Id: Ic6115fc8cf5320abfcf500112ff563bde8b88661
|
|
This parameter can be used for pinning (and later unpinning) the Nova
Compute RPC version.
Change-Id: I2f181f3b01f0b8059566d01db0152a12bbbd1c3e
|
|
Change-Id: I7226070aa87416e79f25625647f8e3076c9e2c9a
|
|
Add Heat software deployments to be used to upgrade major versions of
OpenStack on the controller nodes. All controller services are taken
down while the upgrade is in progress.
The new updated yum repositories should be configured by another process
e.g. the deployment artifacts transfer via Swift.
Change-Id: Ia0a04e4a11d67e7a5acc53c1f8a8f01ed5ca8675
Co-Authored-By: Giulio Fidente <gfidente@redhat.com>
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
|
|
See RHBZ 1311005 and 1247303. In short: sometimes when a controller
node gets fenced, rabbitmq is unable to rejoin the cluster. To fix this
we need two steps:
1) The fix for the RA in BZ 1247303
2) Add notify=true to the meta parameters of the rabbitmq resource on
fresh installs and updates
Note that if this change is applied on systems that do not
have the fix for the rabbitmq resource agent, no action is taken.
So when the resource agent will be updated, the notify
operation will start to work as soon as the first monitor
action will take place.
Fixes RH Bug #1311005
Change-Id: I513daf6d45e1a13d43d3c404cfd6e49d64e51d5a
|
|
|
|
|
|
The maximum payload size of the return signal from a Heat software
deployment is 1MB, and the output of yum starts breaking this limit at
~1000 packages to update - which is not an atypical number. To prevent
this, pass the -q (quiet) option to reduce the amount of output to a
manageable level.
Change-Id: I517271e8465885421a78b73c5af756816c37a977
Resolves-rhbz: #1304878
Closes-Bug: #1543034
|
|
We've seen the 360 second threshold broken and a failed update because
of that, even though Galera eventually synced fine, clusterchecks OK and
pcs status clean. This will give Galera more time to perform the sync.
Change-Id: I17207ec9b4038fb9540582c9b0b717f9b85a78b9
Closes-Bug: #1538218
|
|
Also split out echo_error function to DRY the error output code and
allow changing the way we report errors in a single place.
Change-Id: I448bf0eb49390f03155335736bb4ab4e979db128
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
|
|
Replaces the bash loop with the timeout command in the piloted
cluster restart to minimize downtime.
Change-Id: I9067eed9626ae5aff833d7a9a9ad1e1a6c026327
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
|