Age | Commit message (Collapse) | Author | Files | Lines |
|
This commit introduces a bash file to be sourced into major upgrade
scripts. Into this file we can put specific pieces of migration logic in
the form of bash functions, which can then be called from the upgrade
scripts.
Change-Id: Ibf7aa84d3880e9218c488dec9d707300e1784744
|
|
|
|
|
|
This splits the upgrade script delivery out of the UpgradeWorkflow
and into a new task which delivers the upgrade script for
compute and object-storage nodes. This is intended to be the first
part of the upgrades process, since we need to upgrade swift nodes
before the controllers and then only one at a time. So this will
deliver the upgrade script which can be invoked by the operator
using the existing script in tripleo-common
'upgrade-non-controller.sh'.
This can be invoked by passing the -e
environments/major-upgrade-script-delivery.yaml (added here) to
the openstack overcloud deploy command.
Change-Id: I20a0d4978e907111404f8108c502ab53b69a3296
|
|
This introduces upgrades for Cinder block storage nodes. Currently
Cinder doesn't support upgrade level pinning and cannot safely deal with
version skew. This means that we have to upgrade Cinder storage nodes in
sync with controller nodes (after they were taken down for upgrade,
before they are brought back up) to ensure that Cinder services perform
AMQP communication only within the same major version of Cinder.
According to our current knowledge, Cinder block storage nodes are the
only node type that will have to be upgraded in sync with controllers.
Change-Id: Icec913c015eff744b0f31b513176b4b657df43af
|
|
Since swift isn't managed by pacemaker we need to manually (systemctl)
stop and start the swift services. This moves the duplicate blocks for
start/stop into a common function (we already include that
pacemaker_common_functions.sh here so may as well)
Change-Id: Ic4f23212594c1bf9edc39143bf60c7f6d648fd1d
|
|
|
|
|
|
Old overcloud images don't have python-zaqarclient installed, and new
overclouds' os-collect-config are configured with Zaqar support. This
together means that on upgrade we need to install python-zaqarclient,
otherwise os-collect-config will be restarted during yum update and
crash due to trying to import missing Python module from zaqarclient.
Change-Id: I3e875e14cb60b1b78aec0d9ddc412ccf865abd01
|
|
Quiet down yum during major upgrades to reduce the output size. This is
consistent with what was introduced into minor updates in change
I517271e8465885421a78b73c5af756816c37a977.
Change-Id: Ie6b470e383fdf42870ac6f60ca43e44b4c446ebe
|
|
Create a new SoftwareDeployment that can be used to add a swap file to
all nodes The amount of swap and the location of the swap file can be
customized via parameter_defaults and the swap_size_megabytes/swap_path
parameters.
Change-Id: I1fb14c0fab2255410fceb26c3a7d5cfe0ba57b3b
|
|
Add Satellite 5 support to the RHEL registration environment and
resources. The registration script is updated to support both satellite
versions in place given the similarity of the options for both
scenarios.
The satellite version is detected based on $REG_SAT_URL, and that
determines whether subscription-manager or rhnreg_ks is used.
Change-Id: Ic261c8a16a7d6d3978f8bfc6e53f75dbe1b716db
|
|
|
|
|
|
As part of the major upgrade workflow non-controller nodes are to
be updated by the operator, out-of-band and only after an initial
heat stack-update that invokes the upgrade of the controller nodes.
This review adds a ComputeDeliverUpgradeConfigDeployment_Step3
SoftwareDeploymentGroup to be applied only to compute nodes, and
that depends on the controllers having been upgraded after
ControllerPacemakerUpgradeConfig_Step2.
Its purpose is to deliver but not invoke the upgrade script on
compute nodes to /root/tripleo_upgrade_node.sh .
The non-controller nodes will then be upgraded later by an
operator that will run the script provided for that purpose, like
at https://review.openstack.org/#/c/284722/1 for example.
Change-Id: Ic6115fc8cf5320abfcf500112ff563bde8b88661
|
|
This parameter can be used for pinning (and later unpinning) the Nova
Compute RPC version.
Change-Id: I2f181f3b01f0b8059566d01db0152a12bbbd1c3e
|
|
Change-Id: I7226070aa87416e79f25625647f8e3076c9e2c9a
|
|
Add Heat software deployments to be used to upgrade major versions of
OpenStack on the controller nodes. All controller services are taken
down while the upgrade is in progress.
The new updated yum repositories should be configured by another process
e.g. the deployment artifacts transfer via Swift.
Change-Id: Ia0a04e4a11d67e7a5acc53c1f8a8f01ed5ca8675
Co-Authored-By: Giulio Fidente <gfidente@redhat.com>
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
|
|
See RHBZ 1311005 and 1247303. In short: sometimes when a controller
node gets fenced, rabbitmq is unable to rejoin the cluster. To fix this
we need two steps:
1) The fix for the RA in BZ 1247303
2) Add notify=true to the meta parameters of the rabbitmq resource on
fresh installs and updates
Note that if this change is applied on systems that do not
have the fix for the rabbitmq resource agent, no action is taken.
So when the resource agent will be updated, the notify
operation will start to work as soon as the first monitor
action will take place.
Fixes RH Bug #1311005
Change-Id: I513daf6d45e1a13d43d3c404cfd6e49d64e51d5a
|
|
|
|
|
|
The maximum payload size of the return signal from a Heat software
deployment is 1MB, and the output of yum starts breaking this limit at
~1000 packages to update - which is not an atypical number. To prevent
this, pass the -q (quiet) option to reduce the amount of output to a
manageable level.
Change-Id: I517271e8465885421a78b73c5af756816c37a977
Resolves-rhbz: #1304878
Closes-Bug: #1543034
|
|
We've seen the 360 second threshold broken and a failed update because
of that, even though Galera eventually synced fine, clusterchecks OK and
pcs status clean. This will give Galera more time to perform the sync.
Change-Id: I17207ec9b4038fb9540582c9b0b717f9b85a78b9
Closes-Bug: #1538218
|
|
Also split out echo_error function to DRY the error output code and
allow changing the way we report errors in a single place.
Change-Id: I448bf0eb49390f03155335736bb4ab4e979db128
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
|
|
Replaces the bash loop with the timeout command in the piloted
cluster restart to minimize downtime.
Change-Id: I9067eed9626ae5aff833d7a9a9ad1e1a6c026327
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
|
|
|
|
|
|
With I02f7cf07792765359f19fdf357024d9e48690e42[1] in puppet-tripleo,
puppet is capable of updating all packages itself on non controller
nodes now.
This is a safer mechanism than using the exclude logic in yum_update.sh
since that can cause depdency problems across sub packages.
[1] https://review.openstack.org/#/c/261041/
Closes-Bug: 1534785
Change-Id: I9075a1bb85baa65a9d0afc5d0fd31a1f99a98819
|
|
|
|
Based on observed timeouts during updates bump the stop and start
timeouts for pacemaker service resources (via op_params) to 200.
This is based on the reasoning that the full timeout may be as
long as two elapsed timeout intervals. After an initial timeout,
the sigterm that follows is then allowed another
DefaultTimeoutStopSec seconds. The 200s is produced by allowing
this 2xDefaultTimeoutStopSec (@90s for systemd) and some
scheduling delta. Many thanks to Michele Baldessari.
Closes-Bug: 1531204
Change-Id: If6b43982c958f63bc78ad997400bf1279c23df7e
|
|
Using crm_resource --wait we wait for the cluster to get into
a stable state before moving into the next step of the piloted
restart procedure.
Change-Id: I80199653024383fd07900dad0b8d23fb8afade26
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
|
|
|
|
Occasionally we hit "Error: unable to push cib" during update. This is
probably due to the fact that when we try to replace cib in
yum_update.sh, services on the previous updated controller are still
coming up and changing cib, and racing/conflicting with the cib push
from yum_update.sh.
This commit adds waiting for the cluster to settle before exiting from
yum_update.sh, to avoid this kind of conflict.
Also a check for cib-push success is added, to make the update fail
properly instead of hanging indefinitely as we've observed with this
issue.
Change-Id: I953087e0e565474ac553fd57bea2459d2e3a6081
Closes-Bug: #1527644
|
|
In https://review.openstack.org/#/c/248572/ yum_update.sh
sets the pcs constraints before restarting the cluster. However
after post-update pacemaker run, the previous constraint of
neutron-server...neutron-ovs-cleanup is re-added. Explicitly
remove this before the post-update restart of certain services
Change-Id: I84dd650dcc66ce3f48926cf369b7d691014c2254
|
|
This enables pacemaker maintenantce mode when running Puppet on stack
update. Puppet can try to restart some overcloud services, which
pacemaker tries to prevent, and this can result in a failed Puppet run.
At the end of the puppet run, certain pacemaker resources are restarted
in an additional SoftwareDeployment to make sure that any config changes
have been fully applied. This is only done on stack updates (when
UpdateIdentifier is set to something), because the assumption is that on
stack create services already come up with the correct config.
(Change I9556085424fa3008d7f596578b58e7c33a336f75 has been squashed into
this one.)
Change-Id: I4d40358c511fc1f95b78a859e943082aaea17899
Co-Authored-By: Jiri Stransky <jistr@redhat.com>
Co-Authored-By: James Slagle <jslagle@redhat.com>
|
|
There are two reasons the name property should always be set for deployment
resources:
- The name often shows up in logs, files and API calls, the default
derived name is long and unhelpful
- Sorting by name determines the merge order of os-apply-config, and the
execution order of puppet/shell scripts (note this is different to
resource dependency order) so leaving the default name results in an
undetermined order which could lead to unpredictable deployment of
configs
This change simply sets the name to the resource name, but a future change
should prepend each name with a run-parts style 2 digit prefix so that the
order is explicitly stated. Documentation for extraconfig needs to clearly
state what prefix is needed to override which merge/execution order.
For existing overcloud stacks, heat currently replaces deployment resources
when the name changes, so this change
Depends-On: I95037191915ccd32b2efb72203b146897a4edbc9
Change-Id: Ic4bcd56aa65b981275c3d4214588bfc4de63b3b0
|
|
|
|
|
|
When the Overcloud does not host an instance of haproxy, pcmk will
not have any resource named haproxy-clone so we should not add
any constraint relying on it.
Change-Id: I801f07b7570f3805aa71c22998fec6b6f192b350
|
|
|
|
We forgot to apply the mongod timeout in the cib dump first, to
apply it later in a single cib-push step.
Change-Id: Ib104e51782c6d3f646907cdb06c74fd4cbf9028c
|
|
Older neutron versions have a bug which makes them leave keepalived and
radvd running even after all neutron services are stopped, preventing
neutron router failover from happening. Router can then get stuck on the
inactive node, like this:
[stack@instack ~]$ neutron l3-agent-list-hosting-router default_router
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 48ca9477-b93b-4305-9e6d-9f1c5d3388f0 | overcloud-controller-1.localdomain | True | :-) | standby |
| eba0575c-654f-4da6-b1cd-f7fdf1cd3726 | overcloud-controller-2.localdomain | True | :-) | standby |
| 68815390-251f-4425-a5f8-38bdbf3bdb90 | overcloud-controller-0.localdomain | True | xxx | active |
+--------------------------------------+------------------------------------+----------------+-------+----------+
We need to kill the leftover processes manually to prevent the state
described above from happening.
See https://review.gerrithub.io/#/c/248931
Change-Id: I2deaa176222983daa0c33ab52a6aa5dbe7365302
|
|
The neutron pcs constraints were reworked in
https://review.openstack.org/#/c/229466/
For overclouds deployed with older tripleo-heat-templates the
current pcs ordering constraints will not have those changes,
meaning that the behaviour discussed at
https://bugs.launchpad.net/tripleo/+bug/1501378 is likely
given we will stop and restart all services. This review
applies those, in short, remove the ovs-cleanup after
neutron-server and add openvswitch-agent instead. Detail in
the bug report and linked BZ.
Change-Id: I45822c5fe9029f11635400b7fbd386880ac80a4e
Related-Bug: 1501378
|
|
To avoid pcmk reconfiguring the resources on each config change,
we want to apply the constraints and timeouts from file.
We also *do not* want to alter the timeouts for a few ocf resources
which are rabbitmq, neutron-netns-cleanup and neutron-ovs-cleanup
Change-Id: I6875f19e1f34f0fdcf0928421f49b61d857ca7c8
Co-Authored-By: Andrew Beekhof <abeekhof@redhat.com>
|
|
When the cluster is brought back online after a yum update in
yum_update.sh, we should verify that galera is fully sync'd before
moving on. This ensures the sync is complete before moving on to update
any other nodes in the cluster.
Change-Id: Ie8fc2c5d5214deacea94ca658ac75359b318ced1
|
|
This matches change I6fc18f1ad876c5a25723710a3b20d8ec9519dcba, but we
need it to set it before attempting the cluster stop - yum update -
cluster start cycle, to make sure this cycle doesn't hit the low timeout
limits.
This can be removed once updates from deployments made prior to
I6fc18f1ad876c5a25723710a3b20d8ec9519dcba are no longer supported.
Change-Id: I587136d8d045d213875c657ea5a405074f80c8ad
|
|
Some missing pacemaker constraints were added in the following commits:
https://review.openstack.org/#/c/219770/
https://review.openstack.org/#/c/219665/
https://review.openstack.org/#/c/218931/
https://review.openstack.org/#/c/218930/
Overclouds that were deployed prior to these constraints being added to
tripleo-heat-templates still have the constraints missing. During an
update, stopping and starting the cluster can fail without these
constraints in place. As a workaround, conditionally add these
contraints in yum_update.sh so that we're sure they're always present
before updating.
Change-Id: Id46c85dbbe5e85d362279661091b17ce1b697fe0
|
|
|
|
|
|
Currently, we have a problem because the unregistration happens in the
"post deploy" phase, which works fine when the top-level stack is being
deleted, but not when the ResourceGroup of servers is being scaled down,
because then the normal "post deploy" update ordering is respected and
we try to unregister after the corresponding server has been deleted.
So, instead, register/unregister each node inside the unit of scale,
e.g the role template being scaled down, which is possible via the new
NodesExtraConfig interface, which means unregistration will take
place at the right time both on stack delete and on scale-down.
Change-Id: I8f117a49fd128f268659525dd03ad46ba3daa1bc
|