Age | Commit message (Collapse) | Author | Files | Lines |
|
This issue was spotted during major upgrade where we had calls like
this:
servers: {get_param: servers, Controller}
These get_param calls are hanging indefinitely and make the whole
upgrade end in a timeout. We need to put brackets around the get_param
function when there are multiple arguments:
http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#get-param
This is already done in most of the tree, and the few places where this
was not happening were parts not under CI. After this change the
following grep returns only one false positive:
grep -ir get_param: |grep -v -- '\[' |grep ','
Change-Id: I65b23bb44f37b93e017dd15a5212939ffac76614
Closes-Bug: #1626628
|
|
Running upgrade-non-controller.sh against compute and object storage did
not fail if the /root/tripleo_upgrade_node.sh failed.
This make it harder to detect error in CI system for instance.
Change-Id: I12b7d640547d3b8ec1f70104d159d6052b7638ff
Closes-Bug: 1620973
|
|
This is the initial work to have a function that migrates a full HA
architecture as deployed in Mitaka to the HA architecture as deployed in
Newton where only a few resources are managed by pacemaker.
The sequence is the following:
1) We remove the desired services from pacemaker's control. The services
at this point are still running normally via the systemd service as
invoked by pacemaker
2) We do a "systemctl stop <service>" on all controllers for all the
services that were removed from pacemaker's control. We do this to make
sure that during the yum upgrade, the %post sections that call
"systemctl try-restart" do not take ages, because at this point during
the upgrade rabbit is down. The only exceptions are "openstack-core"
and "delay" which are dummy pacemaker resources that do not exist on
the system
3) We do a "systemctl start <service>" on all nodes for all the services
mentioned above.
We should probably merge this patch only when newton has branched as it
is very specific to the M/N upgrade.
Closes-Bug: 1617520
Change-Id: I4c409ce58c1a57b6e0decc3cf168b62698b32e39
|
|
Change-Id: I7a041dab8b1b1edc9c80248e1eef3ce7ab272292
Closes-Bug: 1615056
|
|
For N we cannot assume services are managed by pacemaker.
This adds functions to check if a service is systemd or
pcmk managed and start/stops it accordingly. For pcmk,
only stop/disable on bootstrap node for example, whereas
systemd should stop/start on all controllers.
There is also an equivalent change to the check_resource
which has been reworked to allow both pcmk and systemd.
Implements: blueprint overcloud-upgrades-workflow-mitaka-to-newton
Change-Id: Ic8252736781dc906b3aef8fc756eb8b2f3bb1f02
|
|
|
|
|
|
|
|
The batch_create and rolling_update keys were incorrectly defined
as properties of the resource instead of update policies.
Change-Id: I19261adc78e4cdc3616f16221e85490a6b48d47b
Closes-Bug: 1623506
|
|
The Ceph upgrade scripts was failing on the following:
1. a syntax error in an if condition
2. an attempt to read a possibly unbound variable
3. an attempt to chown a directory which might not exist
this change aims at fixing all of the above.
Closes-Bug: 1623942
Change-Id: I9e9d63d4ab7626893aaf2a25dccfcafbb97ccbdf
|
|
We need to remove the hard-coded roles from overcloud.j2.yaml
as now it's valid to e.g remove BlockStorage completely.
The previous behavior for the per-role upgrade scripts is maintained
but we'll need to rework this for newton->ocata upgrades where we
can no longer be sure the servers mapping will contain all roles.
Change-Id: I25e6c84757e3c00fba2aae834cd8206c62e44acf
Partially-Implements: blueprint custom-roles
|
|
We make it clear that recoverable checks happen before starting the
upgrade to be able to run the upgrade after the offending error has been
manually corrected.
Add new check for the pcsd cluster status.
Add new check for galera password file: BZ 1357112
Closes-Bug: 1614907
Change-Id: If736c79121e1ffe0eaeb814bdb73ccbc0b64edcd
|
|
|
|
|
|
|
|
With new pacemaker architecture, Puppet handles restarts of most of the
services. There are several still managed by pacemaker which need
special restart handling utilizing pacemaker and its resource
agents.
The counterpart in puppet-tripleo requests restarts for individual
pacemaker-managed services by writing out "restart flag" files, and the
pacemaker_resource_restart.sh script then performs the restarts.
Change-Id: Ia4e6a9f88181f1981993f046cf415dbbcdc9570e
Closes-Bug: #1614967
Closes-Bug: #1587015
Depends-On: I6369ab0c82dbf3c8f21043f8aa9ab810744ddc12
|
|
|
|
This will prevent the Ceph Mon upgrade script from starting if the
Ceph cluster is in error state.
It also adds a parameter to ignore warning states, useful when
performing an upgrade of a cluster where the number of healthy
OSDs does not guarantee the desired replica size.
Closes-Bug: 1618533
Change-Id: I1beb8ad0812f19b1018ba19b5a9fc85fa132d7f7
|
|
Upgrades the ceph-osd daemon from Hammer to Jewel
Change-Id: Idfa90fdc0052c53f448401c85c5d13a2ba68acd1
|
|
|
|
|
|
Adds a pre-requisite software deployment to the pacemaker scenario
upgrade which, before the openstack services are upgraded,
upgrades the ceph-mon daemon from Hammer to Jewel.
Change-Id: I9855d80a6ae156b4a9e0409c3c927068b9db95a0
|
|
|
|
|
|
We have to recreate the /var/lib/mysql directory on all controller node,
not just the boostrap node for the cluster to be able to restart.
Adding a warning on the fact that those script are local and know
nothing about the good upgrade state of the other nodes.
Closes-Bug: 1612642
Change-Id: I48e2812d7df80bbf2db53a8b71dc434d4209a160
|
|
There is a typo in the code, making this test always successful.
Closes-Bug: 1614437
Change-Id: Ia6b0b156294de9fcb8f66fc46aa8801555775a56
|
|
scheduler_host_manager doesn't take
nova.scheduler.host_manager.HostManager as a value anymore. This fix it
before restarting the service.
Change-Id: Ia9adcfd5a898f0c712b4a37ae33db88a44630f0d
Closes-Bug: 1615035
|
|
|
|
|
|
These are not needed any more as they were specific to
mitaka upgrades.
Change-Id: I0d421b942e620403f88374e1c82105747d8d84c9
|
|
The nova api db need to be synchronized as well.
Change-Id: I2628b24ff1153c84cbf388455666ae42570cb10f
Closes-Bug: 1615042
|
|
The MySqlMajorUpgrade parameter has validation on it allowing only
values yes/no/auto, however in the script we checked for '0' instead of
'no', which means the only effective values were yes/auto. This is now
fixed to allow switching the migration off.
Change-Id: I5d64734894c6bfd9003ad643f3747e34e62465cc
Closes-Bug: #1616429
|
|
When upgrading from mariadb X.Y.Z to mariadb X.Y.Ž (X.Y part stays the
same), the dump/restore of mariadb shouldn't be necessary. Therefore we
now only check for up to the first 2 fields of the version string when
determining if we should trigger the dump/restore operation.
Closes-Bug: #1615721
Change-Id: Ib7af8bfb121f5c83184d51b3c6dc657108c25973
|
|
In Newton, Aodh will be using its own mysql DB rather than
using ceilometer's mongo instance. This means we need to
migrate any existing alarm and alrm history data from
ceilometer DB to aodh mysqlDB. Upstream aodh provides us
with a aodh-data-migration utility. We need to invoke this
during the mitaka->newton upgrade procedure so data is
migrated as expected and aodh mysql backend takes over.
Closes-bug: #1611794
Change-Id: I17888b57ecf98cd83e92af2f9cdbead066b03aa3
|
|
Given the new HA architecture with less pacemaker managed resources, we
need to update this script to reflect those changes. Without these
changes, stack-updates using the exact same templates will fail since
this script is always executed on update.
Change-Id: I2ce1681d19d4a24a7561e3dd9c5efdae40d030b7
Closes-Bug: #1612667
|
|
|
|
|
|
When the overcloud is upgraded we do a yum update of the packages.
This step might introduce a newer galera version. In such a situation
we need to dump the db and restore it. The high-level workflow should
be the following:
1) During the main upgrade step, before shutting down the cluster
we need to dump the db
2) We upgrade the packages
3) We briefly start mysql on a single node while making sure that
/root/.my.cnf is briefly moved out of the way (because it contains
a password) and import the data. After the import we shutdown this
mysql instance
4) We let the cluster start up normally
The above steps will take place in the following scenarios.
Given a locally installed mariadb version X.Y.Z and release R,
we will dump and restore the DB under the following conditions:
A) MySqlMajorUpgrade template parameter is set to 'auto' and
the upgraded package differs in X, Y *or* Z. We basically don't
dump automatically if the release field changes.
B) MySqlMajorUpgrade template parameter is set to 'yes'
When MySqlMajorUpgrade is set to 'no', no dumping will be performed.
Note that this will give a non functional upgrade if a major mariadb
upgrade is taking place.
Partial-Bug: #1587449
Co-Author: Damien Ciabrin <dciabrin@redhat.com>
Co-Author: Mike Bayer <mbayer@redhat.com>
Depends-On: I8cb4cb3193e6b823aad48ad7dbbbb227364d2a58
Depends-On: I38dcacfabc44539aab1f7da85168fe44a1b43a51
Change-Id: I374628547aed091129d0deaa29764bfc998d76ea
|
|
Since the Liberty release, the number of services managed by pacemaker
on HA Overcloud has increased. This has an impact on
major_upgrade_controller_pacemaker_1.sh, where cluster sync timeout
value tuned for older releases is now becoming too low.
Raise the cluster sync timeout value to a sensible limit to
give pacemaker enough time to stop the cluster during major upgrade.
Change-Id: I821d354ba30ce39134982ba12a82c429faa3ce62
Closes-Bug: #1597506
|
|
It is best if we disable stonith if a cluster has it configured and on,
before we call "pcs cluster stop --all", because should a service fail
to stop for whatever reason, pacemaker will fence the node where it
happened. This is something that we unlikely want during an upgrade as
it will make things worse.
Once the cluster is stopped we can reenable stonith (if it was enabled
to start with) in the CIB while the cluster is shut down.
Closes-Bug: #1596065
Change-Id: I38dcacfabc44539aab1f7da85168fe44a1b43a51
|
|
Restarting services after Puppet is vital to ensure that config changes
go applied. However, it can be sometimes desirable to prevent these
restarts to avoid downtime, if the operator is sure that no config
changes need applying. This can be a case e.g. when scaling compute
nodes. Passing the puppet-pacemaker-no-restart.yaml environment file *in
addition* to puppet-pacemaker.yaml should allow this.
This is a stop gap solution before we have proper communication between
Puppet and Pacemaker to allow selective restarts.
Change-Id: I9c3c5c10ed6ecd5489a59d7e320c3c69af9e19f4
|
|
|
|
|
|
If "pcs cluster stop --all" is executed on a controller that
happens to have a VIP on the internal network, pcs may use the
VIP as the source address for communication with another cluster
node. When pacemaker is stopped this VIP goes away, and pcs never
receives a response from the other node. This causes pcs to hang
indefinitely; eventually the upgrade times out and fails.
Disabling the VIPs before stopping the cluster avoids this
situation.
Change-Id: I6bc59120211af28456018640033ce3763c373bbb
Closes-Bug: 1577570
|
|
|
|
|
|
Previously ceilometer-notification, aodh-listener and sahara-engine
didn't have constraints that would anchor them under openstack-core
dummy resource. Such constraints are added now. (sahara-engine starting
after sahara-api, aodh-listener after aodh-evaluator, and
ceilometer-notification after openstack-core.) Openstack-core ->
heat-api constraint has been removed because heat-api depends on
ceilometer-notification, so there's a transitive dependency on
openstack-core already.
Change-Id: Ided7321ebbf2c3556726343b4bb466fd8759b43a
Closes-Bug: #1569444
|
|
Previously we tried to use UpdateIdentifier for two different things:
tell whether to perform package update, and also to tell whether the
top-level stack is being created or updated (which was incorrect and
resulted in bug 1567384, and an attempt to work around that bug resulted
in bug 1567385).
We cannot use Heat's "action" conditionals in some cases, because they
refer to the direct parent stack, which can yield undesirable results
when introducing new nested stacks or temporarily no-opping something
and then adding it back (in both these cases, "action" would be
considered "CREATE", even though the top-level stack is in "UPDATE").
So tripleoclient passes a new parameter StackAction to tell whether the
top-level stack is being created or updated, and we make use of
that. (It seems there's no better way of getting this info from within
the nested Heat stacks.)
Change-Id: Ie14ddbff15e7ed21aaa3fcdacf36e0040f912382
Depends-On: I9dc3b4cd8a6a71df34d8babf0e4c6505041f5311
Closes-Bug: #1567384
Related-Bug: #1567385
|
|
The update and upgrade shell scripts were still referencing the
old openstack-keystone service which got removed with
Ie26908ac9bfc0b84b6b65ae3bda711236b03d9d4
Also removes kilo and liberty specific workarounds and config changes.
Change-Id: Icc80904908ee3558930d4639a21812f14b2fd12e
|
|
Removes the old noop nested stack template for extraconfig
tasks and instead uses OS::Heat::None. This should avoid a few
extra resource checks on create and update.
Change-Id: I5a42fc78ece2553e86385236e214aa1e3c91cd85
|