Age | Commit message (Collapse) | Author | Files | Lines |
|
In ocata we changed the ha policy to "ha-exactly" via the following changes:
- tht: Iace6daf27a76cb8ef1050ada0de7ff1f530916c6
- puppet-tripleo: Ib62001c03e1e08f58cf0c6e0ba07a8879a584084
We initially also took care of changing this policy (which is set in the
pacemaker resource agent) for the M/N upgrade path:
I2468a096b5d7042bc801a742a7a85fb1521c1c02
In the end we decided against changing the policy in Newton as well (it
was only for ocata) as it was too close to the release date and we took
the safer path.
This patch does two things:
1) It renames the upgrade function to "newton_ocata" since that is the
only upgrade path we need to take care of
2) It reinstates the actual upgrade function which was mistakenly
removed via an unrelated change in the ceilometer upgrade path:
If9d6987cd0a8fc5d3f9de518ba422d97d5149732
Closes-Bug: #1628998
Change-Id: I3a97505d2ae1ae27f3080ffe74c33fdabffd2420
|
|
Currently when we call the major-upgrade step we do the following:
"""
...
if [[ -n $(is_bootstrap_node) ]]; then
check_clean_cluster
fi
...
if [[ -n $(is_bootstrap_node) ]]; then
migrate_full_to_ng_ha
fi
...
for service in $(services_to_migrate); do
manage_systemd_service stop "${service%%-clone}"
...
done
"""
The problem with the above code is that it is open to the following race
condition:
1. Code gets run first on a non-bootstrap controller node so we start
stopping a bunch of services
2. Pacemaker notices will notice that services are down and will mark
the service as stopped
3. Code gets run on the bootstrap node (controller-0) and the
check_clean_cluster function will fail and exit
4. Eventually also the script on the non-bootstrap controller node will
timeout and exit because the cluster never shut down (it never actually
started the shutdown because we failed at 3)
Let's make sure we first only call the HA NG migration step as a
separate heat step. Only afterwards we start shutting down the systemd
services on all nodes.
We also need to move the STONITH_STATE variable into a file because it
is being used across two different scripts (1 and 2) and we need to
store that state.
Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com>
Closes-Bug: #1640407
Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
|
|
rpm command will return an exit 1 if ovs package is already
there and will exit the step_1.sh script. To get around this
force the update with --replacepkgs
Also remove the \ just before the $ which cause a syntax
error for the ceph storage
Change-Id: I11fcf688982ceda5eef7afc8904afae44300c2d9
Closes-bug: 1636748
|
|
|
|
We currently set the stonith property from all controller nodes during
upgrade. This is racy and can actually end up disabling stonith after
the upgrade even if when it was enabled.
Let's set the property only from the bootstrap node.
Change-Id: Id4afb867b485ac853be874a0179a7ed7cc914068
Closes-Bug: #1635294
|
|
This adds a special case handling for the opensvswitch package
as discussed at the related bug below.
This is added/handled here for both the minor update and the
major mitaka...newton upgrade.
Change-Id: I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae
Closes-Bug: 1635205
|
|
This takes care of the M->N upgrade path when changing
the ha rabbitmq policy.
Partial-Bug: #1628998
Change-Id: I2468a096b5d7042bc801a742a7a85fb1521c1c02
|
|
Before this change we checked the cluster for any failed actions and
we stopped the upgrade process if there were any.
This is likely eccessive as a failed action could have happened in the
past and the cluster is now fully functional.
Better to check if any of the resources are in Stopped state and break
the upgrade process if any of them are.
We also need to restrict this check to the bootstrap node because
otherwise the following might happen:
1) Bootstrap node does the check, it is successful and it starts
the full HA -> HA NG migration which *will* create failed actions
and will start stopping resources
2) If the check now starts on a non-bootstrap node while 1) is ongoing,
it will find either failed actions or stopped resources so it will
fail.
Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f
Closes-Bug: #1628653
|
|
|
|
|
|
|
|
This commit does the following:
1. We now explicitly disable/stop and then remove the resources that are
moving to systemd. We do this because we want to make sure they are all
stopped before doing a yum upgrade, which otherwise would take ages due
to rabbitmq and galera being down. It is best if we do this via pcs
while we do the HA Full -> HA NG migration because it is simpler to make
sure all the services are stopped at that stage. For extra safety we can
still do a check by hand. By doing it via pacemaker we have the
guarantee that all the migrated services are down already when we stop
the cluster (which happens to be a syncronization point between all
controller nodes). That way we can be certain that they are all down on
all nodes before starting the yum upgrade process.
2. We actually need to start the systemd services in
major_upgrade_controller_pacemaker_2.sh and not stop them.
3. We need to use the proper bash variable name
4. Use is_bootstrap_node everywhere to make the code more consistent
Change-Id: Ic565c781b80357bed9483df45a4a94ec0423487c
Closes-Bug: #1627490
|
|
Currently we do not disable openstack-cinder-volume during our
major-upgrade-pacemaker step. This leads to the following scenario. In
major_upgrade_controller_pacemaker_2.sh we do:
start_or_enable_service galera
check_resource galera started 600
....
if [[ -n $(is_bootstrap_node) ]]; then
...
cinder-manage db sync
...
What happens here is that since openstack-cinder-volume was never
disabled it will already be started by pacemaker before we call
cinder-manage and this will give us the following errors during the
start:
06:05:21.861 19482 ERROR cinder.cmd.volume DBError:
(pymysql.err.InternalError) (1054, u"Unknown column 'services.cluster_name' in 'field list'")
Change-Id: I01b2daf956c30b9a4985ea62cbf4c941ec66dcdf
Closes-Bug: #1627470
|
|
In bug https://bugs.launchpad.net/tripleo/+bug/1615035 we fixed the
scheduler_host setting which got deprecated in newton. It seems also the
scheduler_driver settings needs tweaking:
systemctl status openstack-nova-scheduler.service:
2016-09-24 20:24:54.337 15278 WARNING stevedore.named [-] Could not load nova.scheduler.filter_scheduler.FilterScheduler
2016-09-24 20:24:54.338 15278 CRITICAL nova [-] RuntimeError: (u'Cannot load scheduler driver from configuration %(conf)s.',
{'conf': 'nova.scheduler.filter_scheduler.FilterScheduler'})
Let's set this to default during the upgrade step. From newton's nova.conf:
The class of the driver used by the scheduler. This should be chosen
from one of the entrypoints under the namespace 'nova.scheduler.driver'
of file 'setup.cfg'. If nothing is specified in this option, the
'filter_scheduler' is used.
This option also supports deprecated full Python path to the class to
be used. For example, "nova.scheduler.filter_scheduler.FilterScheduler".
But note: this support will be dropped in the N Release.
Change-Id: Ic384292ad05a57757158995ec4c1a269fe4b00f1
Depends-On: I89124ead8928ff33e6b6907a7c2178169e91f4e6
Closes-Bug: #1627450
|
|
With commit fb25385d34e604d2f670cebe3e03fd57c14fa6be
"Rework the pacemaker_common_functions for M..N upgrades" we
accidentally removed some lines that fixed M/N upgrade issues.
Namely:
extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh
-# https://bugzilla.redhat.com/show_bug.cgi?id=1284047
-# Change-Id: Ib3f6c12ff5471e1f017f28b16b1e6496a4a4b435
-crudini --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend rabbit
-# https://bugzilla.redhat.com/show_bug.cgi?id=1284058
-# Ifd1861e3df46fad0e44ff9b5cbd58711bbc87c97 Swift Ceilometer middleware no longer exists
-crudini --set /etc/swift/proxy-server.conf pipeline:main pipeline "catch_errors healthcheck cache ratelimit tempurl formpost authtoken keystone staticweb proxy-logging proxy-server"
-# LP: 1615035, required only for M/N upgrade.
-crudini --set /etc/nova/nova.conf DEFAULT scheduler_host_manager host_manager
extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
nova-manage db sync
- nova-manage api_db sync
This patch simply puts that code back without reverting the
whole commit that broke things, because that is needed.
Closes-Bug: #1627448
Change-Id: I89124ead8928ff33e6b6907a7c2178169e91f4e6
|
|
This is the initial work to have a function that migrates a full HA
architecture as deployed in Mitaka to the HA architecture as deployed in
Newton where only a few resources are managed by pacemaker.
The sequence is the following:
1) We remove the desired services from pacemaker's control. The services
at this point are still running normally via the systemd service as
invoked by pacemaker
2) We do a "systemctl stop <service>" on all controllers for all the
services that were removed from pacemaker's control. We do this to make
sure that during the yum upgrade, the %post sections that call
"systemctl try-restart" do not take ages, because at this point during
the upgrade rabbit is down. The only exceptions are "openstack-core"
and "delay" which are dummy pacemaker resources that do not exist on
the system
3) We do a "systemctl start <service>" on all nodes for all the services
mentioned above.
We should probably merge this patch only when newton has branched as it
is very specific to the M/N upgrade.
Closes-Bug: 1617520
Change-Id: I4c409ce58c1a57b6e0decc3cf168b62698b32e39
|
|
Change-Id: I7a041dab8b1b1edc9c80248e1eef3ce7ab272292
Closes-Bug: 1615056
|
|
For N we cannot assume services are managed by pacemaker.
This adds functions to check if a service is systemd or
pcmk managed and start/stops it accordingly. For pcmk,
only stop/disable on bootstrap node for example, whereas
systemd should stop/start on all controllers.
There is also an equivalent change to the check_resource
which has been reworked to allow both pcmk and systemd.
Implements: blueprint overcloud-upgrades-workflow-mitaka-to-newton
Change-Id: Ic8252736781dc906b3aef8fc756eb8b2f3bb1f02
|
|
We make it clear that recoverable checks happen before starting the
upgrade to be able to run the upgrade after the offending error has been
manually corrected.
Add new check for the pcsd cluster status.
Add new check for galera password file: BZ 1357112
Closes-Bug: 1614907
Change-Id: If736c79121e1ffe0eaeb814bdb73ccbc0b64edcd
|
|
|
|
|
|
|
|
We have to recreate the /var/lib/mysql directory on all controller node,
not just the boostrap node for the cluster to be able to restart.
Adding a warning on the fact that those script are local and know
nothing about the good upgrade state of the other nodes.
Closes-Bug: 1612642
Change-Id: I48e2812d7df80bbf2db53a8b71dc434d4209a160
|
|
There is a typo in the code, making this test always successful.
Closes-Bug: 1614437
Change-Id: Ia6b0b156294de9fcb8f66fc46aa8801555775a56
|
|
scheduler_host_manager doesn't take
nova.scheduler.host_manager.HostManager as a value anymore. This fix it
before restarting the service.
Change-Id: Ia9adcfd5a898f0c712b4a37ae33db88a44630f0d
Closes-Bug: 1615035
|
|
The MySqlMajorUpgrade parameter has validation on it allowing only
values yes/no/auto, however in the script we checked for '0' instead of
'no', which means the only effective values were yes/auto. This is now
fixed to allow switching the migration off.
Change-Id: I5d64734894c6bfd9003ad643f3747e34e62465cc
Closes-Bug: #1616429
|
|
|
|
When the overcloud is upgraded we do a yum update of the packages.
This step might introduce a newer galera version. In such a situation
we need to dump the db and restore it. The high-level workflow should
be the following:
1) During the main upgrade step, before shutting down the cluster
we need to dump the db
2) We upgrade the packages
3) We briefly start mysql on a single node while making sure that
/root/.my.cnf is briefly moved out of the way (because it contains
a password) and import the data. After the import we shutdown this
mysql instance
4) We let the cluster start up normally
The above steps will take place in the following scenarios.
Given a locally installed mariadb version X.Y.Z and release R,
we will dump and restore the DB under the following conditions:
A) MySqlMajorUpgrade template parameter is set to 'auto' and
the upgraded package differs in X, Y *or* Z. We basically don't
dump automatically if the release field changes.
B) MySqlMajorUpgrade template parameter is set to 'yes'
When MySqlMajorUpgrade is set to 'no', no dumping will be performed.
Note that this will give a non functional upgrade if a major mariadb
upgrade is taking place.
Partial-Bug: #1587449
Co-Author: Damien Ciabrin <dciabrin@redhat.com>
Co-Author: Mike Bayer <mbayer@redhat.com>
Depends-On: I8cb4cb3193e6b823aad48ad7dbbbb227364d2a58
Depends-On: I38dcacfabc44539aab1f7da85168fe44a1b43a51
Change-Id: I374628547aed091129d0deaa29764bfc998d76ea
|
|
Since the Liberty release, the number of services managed by pacemaker
on HA Overcloud has increased. This has an impact on
major_upgrade_controller_pacemaker_1.sh, where cluster sync timeout
value tuned for older releases is now becoming too low.
Raise the cluster sync timeout value to a sensible limit to
give pacemaker enough time to stop the cluster during major upgrade.
Change-Id: I821d354ba30ce39134982ba12a82c429faa3ce62
Closes-Bug: #1597506
|
|
It is best if we disable stonith if a cluster has it configured and on,
before we call "pcs cluster stop --all", because should a service fail
to stop for whatever reason, pacemaker will fence the node where it
happened. This is something that we unlikely want during an upgrade as
it will make things worse.
Once the cluster is stopped we can reenable stonith (if it was enabled
to start with) in the CIB while the cluster is shut down.
Closes-Bug: #1596065
Change-Id: I38dcacfabc44539aab1f7da85168fe44a1b43a51
|
|
If "pcs cluster stop --all" is executed on a controller that
happens to have a VIP on the internal network, pcs may use the
VIP as the source address for communication with another cluster
node. When pacemaker is stopped this VIP goes away, and pcs never
receives a response from the other node. This causes pcs to hang
indefinitely; eventually the upgrade times out and fails.
Disabling the VIPs before stopping the cluster avoids this
situation.
Change-Id: I6bc59120211af28456018640033ce3763c373bbb
Closes-Bug: 1577570
|
|
The update and upgrade shell scripts were still referencing the
old openstack-keystone service which got removed with
Ie26908ac9bfc0b84b6b65ae3bda711236b03d9d4
Also removes kilo and liberty specific workarounds and config changes.
Change-Id: Icc80904908ee3558930d4639a21812f14b2fd12e
|
|
Since swift isn't managed by pacemaker we need to manually (systemctl)
stop and start the swift services. This moves the duplicate blocks for
start/stop into a common function (we already include that
pacemaker_common_functions.sh here so may as well)
Change-Id: Ic4f23212594c1bf9edc39143bf60c7f6d648fd1d
|
|
Old overcloud images don't have python-zaqarclient installed, and new
overclouds' os-collect-config are configured with Zaqar support. This
together means that on upgrade we need to install python-zaqarclient,
otherwise os-collect-config will be restarted during yum update and
crash due to trying to import missing Python module from zaqarclient.
Change-Id: I3e875e14cb60b1b78aec0d9ddc412ccf865abd01
|
|
Quiet down yum during major upgrades to reduce the output size. This is
consistent with what was introduced into minor updates in change
I517271e8465885421a78b73c5af756816c37a977.
Change-Id: Ie6b470e383fdf42870ac6f60ca43e44b4c446ebe
|
|
This parameter can be used for pinning (and later unpinning) the Nova
Compute RPC version.
Change-Id: I2f181f3b01f0b8059566d01db0152a12bbbd1c3e
|
|
Change-Id: I7226070aa87416e79f25625647f8e3076c9e2c9a
|