Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
In I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae and
I11fcf688982ceda5eef7afc8904afae44300c2d9 we added a manual step
for upgrading openvswitch in order to specify the --nopostun
as discussed in the bug below.
This change adds a minor update to make this workaround more
robust. It removes any existing rpms that may be around from
an earlier run, and also checks that the rpms installed are
at least newer than the version we are on.
This also refactors the code into a common definition in the
pacemaker_common_functions.sh which is included even for the
heredocs generating upgrade scripts during init. Thanks
Sofer Athlan-Guyot and Jirka Stransky for help with that.
Change-Id: Idc863de7b5a8c116c990ee8c1472cfe377836d37
Related-Bug: 1635205
|
|
There are scenarios in which findmnt will return a list of all
mounted filesystems, which causes the upgrade script to fail in
recognizing if the Ceph OSD is backed by ext4.
Change-Id: Iadebdc32b523c05216202b782ceb54bec4389413
Closes-Bug: #1649407
|
|
|
|
Running os-net-config before restarting the cluster prevents changes to
the interface files caused by changes to implementation from bouncing
network interfaces after the cluster has restarted.
Closes-Bug: #1644138
Change-Id: I65fb104465ff3d37ddc791634302994334136014
|
|
During ceilometer pre upgrade, rabbit host config gets overridden in
ceilometer conf as its setting to defaults. This explicitly sets the
host info in standalone manifest.
Closes-Bug: #1644278
Change-Id: I862ea7165c5d42ba1f9a19111a8be8934c0ef883
|
|
In I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae and
I11fcf688982ceda5eef7afc8904afae44300c2d9 we landed a workaround
for the openvswitch 2.4 to 2.5 upgrade discussed in the bug below.
Unfortunately testing has revealed a problem with the minor update
case specifically for non controllers. It seems we would exit
before the ovs workaround has had a chance to execute. This moves
the block up a few lines to avoid this condition. As with the
other two reviews noted here, this will need to go into newton
and then mitaka too.
Change-Id: If905de82d96302334ebe02de9c43f00faed9b72b
Related-Bug: 1635205
|
|
|
|
|
|
https://review.openstack.org/#/c/388688/ has removed ceilometer-dbsync so
ceilometer-upgrade must be used instead.
Additionally, ceilometer-dbsync enabled option --skip-gnocchi-resource-types
and ceilometer-upgrade doesn't, so i'm setting it by default to ensure backwards compatibility.
Note this is based on the corresponding fix to puppet-ceilometer ref
https://review.openstack.org/#/c/396570
Change-Id: Ic0a15c75d1cd3e3f70eeafd9ba09d50c58cc1293
Closes-Bug: #1641076
|
|
Deployments using external LB will file like this:
deploy_stderr: |
+ RESTART_FOLDER=/var/lib/tripleo/pacemaker-restarts
+ [[ -d /var/lib/tripleo/pacemaker-restarts ]]
++ systemctl is-active haproxy
+ haproxy_status=unknown
deploy_status_code: 3
openstack software deployment show 4f339ca4-7600-4ca0-b0ef-f798bc47b6cf
The reason is that via https://review.openstack.org/#/c/393644/ we
introducted the haproxy restart like this:
haproxy_status=$(systemctl is-active haproxy)
if [ "$haproxy_status" = "active" ]; then
systemctl reload haproxy
fi
The problem is that if haproxy is not running/installed systemctl
is-active can fail and the script will terminate with an error return
code. Let's just move the call inside the if so the script does not fail
in case haproxy is not there.
The snippet before the change (on a system without haproxy installed):
[root@mrg-09 tmp]# ./test.sh
++ systemctl is-active haproxy
+ haproxy_status=unknown
[root@mrg-09 tmp]# echo $?
3
After this change:
[root@mrg-09 tmp]# ./test.sh
++ systemctl is-active haproxy
+ '[' unknown = active ']'
[root@mrg-09 tmp]# echo $?
0
Change-Id: I837c63a9dbcde8c922f843c442974fa79cf1eede
Closes-Bug: #1641904
|
|
In ocata we changed the ha policy to "ha-exactly" via the following changes:
- tht: Iace6daf27a76cb8ef1050ada0de7ff1f530916c6
- puppet-tripleo: Ib62001c03e1e08f58cf0c6e0ba07a8879a584084
We initially also took care of changing this policy (which is set in the
pacemaker resource agent) for the M/N upgrade path:
I2468a096b5d7042bc801a742a7a85fb1521c1c02
In the end we decided against changing the policy in Newton as well (it
was only for ocata) as it was too close to the release date and we took
the safer path.
This patch does two things:
1) It renames the upgrade function to "newton_ocata" since that is the
only upgrade path we need to take care of
2) It reinstates the actual upgrade function which was mistakenly
removed via an unrelated change in the ceilometer upgrade path:
If9d6987cd0a8fc5d3f9de518ba422d97d5149732
Closes-Bug: #1628998
Change-Id: I3a97505d2ae1ae27f3080ffe74c33fdabffd2420
|
|
|
|
|
|
Currently when we call the major-upgrade step we do the following:
"""
...
if [[ -n $(is_bootstrap_node) ]]; then
check_clean_cluster
fi
...
if [[ -n $(is_bootstrap_node) ]]; then
migrate_full_to_ng_ha
fi
...
for service in $(services_to_migrate); do
manage_systemd_service stop "${service%%-clone}"
...
done
"""
The problem with the above code is that it is open to the following race
condition:
1. Code gets run first on a non-bootstrap controller node so we start
stopping a bunch of services
2. Pacemaker notices will notice that services are down and will mark
the service as stopped
3. Code gets run on the bootstrap node (controller-0) and the
check_clean_cluster function will fail and exit
4. Eventually also the script on the non-bootstrap controller node will
timeout and exit because the cluster never shut down (it never actually
started the shutdown because we failed at 3)
Let's make sure we first only call the HA NG migration step as a
separate heat step. Only afterwards we start shutting down the systemd
services on all nodes.
We also need to move the STONITH_STATE variable into a file because it
is being used across two different scripts (1 and 2) and we need to
store that state.
Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com>
Closes-Bug: #1640407
Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
|
|
After compute nodes are upgraded, the ceilometer compute agent
doesnt poll and throws warnings. Restarting the compute agent
at this step gets the service back to its normal state.
Closes-Bug: #1640177
Change-Id: I7392de43e933b1d16002e12e407748ae289d5e99
|
|
After deploying a fresh installed Overcloud or updating the stack
the haproxy configuration is updated correctly but no change in the
HA proxy stats happens.
This submission will add the missing resources to run pre and post
puppet tasks.
Closes-bug: 1640175
Change-Id: I2f08704daeee502c618256695a30ce244a1d7ba5
|
|
|
|
|
|
|
|
Seems the conditional has changed and we should pickup the
tripleo::profile::base::swift::storage::enable_swift_storage
hiera data.
After controller nodes are upgraded the swift services were down
even though there was no stand-alone swift node (the current
conditional was failing as that hiera isn't set any more)
Closes-Bug: 1638821
Change-Id: Id1383c1e54f9cae13fd375e90da525230e5d23eb
|
|
OPM package is metadata package with unversioned requirements which
means that update does not update the dependencies. This leaves us
with old puppet modules and old puppet during the puppet run.
Change-Id: I80f8a73142a09bb4178bb5a396d256ba81ba98a8
Closes-Bug: #1638266
Resolves: rhbz#1390559
|
|
gnocchi when configured with swift will require keystone
to be available to authenticate to migrate to v3. At this
step keystone is not available and gnocchi upgrade fails
with auth error. Instead start apache in step 3, start
apache first and then run gnocchi upgrade in a separate
step and let upgrade happen here.
Closes-Bug: #1634897
Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0
|
|
rpm command will return an exit 1 if ovs package is already
there and will exit the step_1.sh script. To get around this
force the update with --replacepkgs
Also remove the \ just before the $ which cause a syntax
error for the ceph storage
Change-Id: I11fcf688982ceda5eef7afc8904afae44300c2d9
Closes-bug: 1636748
|
|
|
|
With the following two changes we increased the timeout for redis and
rabbit for both starting and stopping to 200s:
https://review.openstack.org/386618 newton (merged)
https://review.openstack.org/385555 master (merged)
We want to also fix that on minor updates on all our supported
releases upstream and downstream (newton, mitaka, liberty, kilo).
This way we can guarantee that we have a uniform timeout for
sart and stop for rabbit and redis across all our releases.
Change-Id: If59bf3386832ee78d3a654f01077aff2e8be76e8
Closes-Bug: #1634851
|
|
We currently set the stonith property from all controller nodes during
upgrade. This is racy and can actually end up disabling stonith after
the upgrade even if when it was enabled.
Let's set the property only from the bootstrap node.
Change-Id: Id4afb867b485ac853be874a0179a7ed7cc914068
Closes-Bug: #1635294
|
|
This adds a special case handling for the opensvswitch package
as discussed at the related bug below.
This is added/handled here for both the minor update and the
major mitaka...newton upgrade.
Change-Id: I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae
Closes-Bug: 1635205
|
|
We have the following function in the upgrade process after we updated
the packages and called the db-sync commands:
services=$(services_to_migrate)
...
for service in $(services); do
manage_systemd_service start "${service%%-clone}"
check_resource_systemd "${service%%-clone}" started 600
done
The above is broken because $services contains a list of services to
start, so $(services) will return gibberish and the for loop will never
execute anything.
One of the symptoms for this is the openstack-nova-compute service not
restarting on the compute nodes during the yum -y upgrade. The reason
for this is that during the service restart, nova-compute waits for
nova-conductor to show up in the rabbitmq queues, which cannot happen
since the service was actually never started.
Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3
Closes-Bug: #1630580
|
|
In Newton, ceilometer api is changed to run under apache wsgi
instead of eventlet. This will require upgrades for mitaka
deployments to switch to wsgi.
Closes-Bug: 1631297
Change-Id: If9d6987cd0a8fc5d3f9de518ba422d97d5149732
|
|
The default path if the operator does nothing is to keep the
sahara services on mitaka to newton upgrades.
If the operator wishes to remove sahara services then they
need to specify the provided major-upgrade-remove-sahara.yaml
environment file in the stack upgrade commands.
The existing migration to ha arch already removes the constraints
and pcs resource for sahara api/engine so we just need to stop
it from starting again if we want to remove it.
This adds a KeepSaharaServiceOnUpgrade parameter to determine if
Sahara is disabled from starting up after the controllers are
upgraded (defaults true).
Finally it is worth noting that we default the sahara services
as 'on' during converge here in the resource_registry of the
converge environment file; any subsequent stack updates where
the deployment contains sahara services will need to
include the -e /environments/services/sahara.yaml environment
file.
Related-Bug: 1630247
Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
|
|
|
|
|
|
heat failed due to a:
service: unbound variable
In the context $service is never set.
Change-Id: If82ee4562612f2617b676732956396278ee40a88
Closes-Bug: #1629903
|
|
This takes care of the M->N upgrade path when changing
the ha rabbitmq policy.
Partial-Bug: #1628998
Change-Id: I2468a096b5d7042bc801a742a7a85fb1521c1c02
|
|
|
|
|
|
As per [1] we need to lower osd max object name and namespace len when
upgrading from Hammer and the OSD is backed by ext4.
These could also be given via ExtraConfig but on upgrade we only run
puppet apply after this script is executed, so the values won't be
effective unless the daemon is restarted. Yet we do not want puppet
to restart the daemon because we can't bring all OSDs down
unconditionally or guests will die.
1. http://tracker.ceph.com/issues/16187
Co-Authored-By: Michele Baldessari <michele@acksyn.org>
Co-Authored-By: Dimitri Savineau <dsavinea@redhat.com>
Change-Id: I7fec4e2426bdacd5f364adbebd42ab23dcfa523a
Closes-Bug: 1628874
|
|
|
|
|
|
Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482
Related-Bug: #1626592
|
|
|
|
Previously the chown command wasn't traversing symlinks, causing
the new ownership to not be set for some needed files.
This change also ensures the crush map tunables are set to the 'default'
profile after the upgrade.
Finally redirects the output of a pidof to /dev/null to avoid spurious
logging.
Change-Id: Id4865ffff207edfc727d729f9cc04e6e81ad19d8
|
|
Before this change we checked the cluster for any failed actions and
we stopped the upgrade process if there were any.
This is likely eccessive as a failed action could have happened in the
past and the cluster is now fully functional.
Better to check if any of the resources are in Stopped state and break
the upgrade process if any of them are.
We also need to restrict this check to the bootstrap node because
otherwise the following might happen:
1) Bootstrap node does the check, it is successful and it starts
the full HA -> HA NG migration which *will* create failed actions
and will start stopping resources
2) If the check now starts on a non-bootstrap node while 1) is ongoing,
it will find either failed actions or stopped resources so it will
fail.
Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f
Closes-Bug: #1628653
|
|
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
has the following code:
...
check_resource mongod started 600
if [[ -n $(is_bootstrap_node) ]]; then
...
tstart=$(date +%s)
while ! clustercheck; do
sleep 5
tnow=$(date +%s)
if (( tnow-tstart > galera_sync_timeout )) ; then
echo_error "ERROR galera sync timed out"
exit 1
fi
done
# Run all the db syncs
cinder-manage db sync
...
fi
start_or_enable_service rabbitmq
check_resource rabbitmq started 600
start_or_enable_service redis
check_resource redis started 600
start_or_enable_service openstack-cinder-volume
check_resource openstack-cinder-volume started 600
systemctl_swift start
for service in $(services_to_migrate); do
manage_systemd_service start "${service%%-clone}"
check_resource_systemd "${service%%-clone}" started 600
done
"""
The problem with the above code is that it is open to the following race
condition:
1) Bootstrap node is busy checking the galera status via cluster check
2) Non-bootstrap node has already reached: start_or_enable_service
rabbitmq and later lines. These lines will be skipped because
start_or_enable_service is a noop on non-bootstrap nodes and
check_resource rabbitmq only checks that pcs status |grep rabbitmq
returns true.
3) Non-bootstrap node can then reach the manage_systemd_service start
and it will fail with stuff like:
"Job for openstack-nova-scheduler.service failed because the control
process exited with error code. See \"systemctl status
openstack-nova-scheduler.service\" and \"journalctl -xe\" for
details.\n" (because the db tables are not migrated yet)
This happens because 3) was started on non-bootstrap nodes before the
db-sync statements are complete on the bootstrap node. I did not feel
like changing the semantics of check_resource and remove the noop on
non-bootstrap nodes as other parts of the tree might rely on this
behaviour.
Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed
Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9
Closes-Bug: #1627965
|
|
We call gnocchi-upgrade to make sure we update all the needed schemas
during the major-upgrade-pacemaker step.
We also make sure that redis is started before we call gnocchi-upgrade
otherwise the command will be stuck in a loop trying to contact redis.
Closes-Bug: #1626592
Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
|
|
Currently we do the following in the migration path:
pcs property set maintenance-mode=true
if ! timeout -k 10 300 crm_resource --wait; then
echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting."
exit 1
fi
crm_resource --wait can actually take forever under certain conditions.
The property will be set atomically across the cluster nodes so we should be good
without this.
Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04
Closes-Bug: #1628393
|
|
After a successful upgrade to Newton, I ran the tripleo.sh
--overcloud-pingtest and it failed with the following:
resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409)
The issue is the fact that some tables have migrated to the
nova_api db and we need to migrate the data as well.
Currently we do:
nova-manage db sync
nova-manage api_db sync
We want to add:
nova-manage db online_data_migrations
After launching this command the overcloud-pingtest works correctly:
tripleo.sh -- Overcloud pingtest SUCCEEDED
Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096
Closes-Bug: #1628450
|
|
|
|
|