Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
|
|
|
|
Before this change we checked the cluster for any failed actions and
we stopped the upgrade process if there were any.
This is likely eccessive as a failed action could have happened in the
past and the cluster is now fully functional.
Better to check if any of the resources are in Stopped state and break
the upgrade process if any of them are.
We also need to restrict this check to the bootstrap node because
otherwise the following might happen:
1) Bootstrap node does the check, it is successful and it starts
the full HA -> HA NG migration which *will* create failed actions
and will start stopping resources
2) If the check now starts on a non-bootstrap node while 1) is ongoing,
it will find either failed actions or stopped resources so it will
fail.
Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f
Closes-Bug: #1628653
|
|
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
has the following code:
...
check_resource mongod started 600
if [[ -n $(is_bootstrap_node) ]]; then
...
tstart=$(date +%s)
while ! clustercheck; do
sleep 5
tnow=$(date +%s)
if (( tnow-tstart > galera_sync_timeout )) ; then
echo_error "ERROR galera sync timed out"
exit 1
fi
done
# Run all the db syncs
cinder-manage db sync
...
fi
start_or_enable_service rabbitmq
check_resource rabbitmq started 600
start_or_enable_service redis
check_resource redis started 600
start_or_enable_service openstack-cinder-volume
check_resource openstack-cinder-volume started 600
systemctl_swift start
for service in $(services_to_migrate); do
manage_systemd_service start "${service%%-clone}"
check_resource_systemd "${service%%-clone}" started 600
done
"""
The problem with the above code is that it is open to the following race
condition:
1) Bootstrap node is busy checking the galera status via cluster check
2) Non-bootstrap node has already reached: start_or_enable_service
rabbitmq and later lines. These lines will be skipped because
start_or_enable_service is a noop on non-bootstrap nodes and
check_resource rabbitmq only checks that pcs status |grep rabbitmq
returns true.
3) Non-bootstrap node can then reach the manage_systemd_service start
and it will fail with stuff like:
"Job for openstack-nova-scheduler.service failed because the control
process exited with error code. See \"systemctl status
openstack-nova-scheduler.service\" and \"journalctl -xe\" for
details.\n" (because the db tables are not migrated yet)
This happens because 3) was started on non-bootstrap nodes before the
db-sync statements are complete on the bootstrap node. I did not feel
like changing the semantics of check_resource and remove the noop on
non-bootstrap nodes as other parts of the tree might rely on this
behaviour.
Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed
Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9
Closes-Bug: #1627965
|
|
We call gnocchi-upgrade to make sure we update all the needed schemas
during the major-upgrade-pacemaker step.
We also make sure that redis is started before we call gnocchi-upgrade
otherwise the command will be stuck in a loop trying to contact redis.
Closes-Bug: #1626592
Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
|
|
Currently we do the following in the migration path:
pcs property set maintenance-mode=true
if ! timeout -k 10 300 crm_resource --wait; then
echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting."
exit 1
fi
crm_resource --wait can actually take forever under certain conditions.
The property will be set atomically across the cluster nodes so we should be good
without this.
Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04
Closes-Bug: #1628393
|
|
After a successful upgrade to Newton, I ran the tripleo.sh
--overcloud-pingtest and it failed with the following:
resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409)
The issue is the fact that some tables have migrated to the
nova_api db and we need to migrate the data as well.
Currently we do:
nova-manage db sync
nova-manage api_db sync
We want to add:
nova-manage db online_data_migrations
After launching this command the overcloud-pingtest works correctly:
tripleo.sh -- Overcloud pingtest SUCCEEDED
Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096
Closes-Bug: #1628450
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The paramater IgnoreCephUpgradeWarnings is type cast into a boolean
which is rendered as 'True' or 'False' as a string not 'true' or
'false'. This fix the check.
Change-Id: I8840c384d07f9d185a72bde5f91a3872a321f623
Closes-Bug: 1627736
|
|
This issue was spotted during major upgrade where we had calls like
this:
servers: {get_param: servers, Controller}
These get_param calls are hanging indefinitely and make the whole
upgrade end in a timeout. We need to put brackets around the get_param
function when there are multiple arguments:
http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#get-param
This is already done in most of the tree, and the few places where this
was not happening were parts not under CI. After this change the
following grep returns only one false positive:
grep -ir get_param: |grep -v -- '\[' |grep ','
Change-Id: I65b23bb44f37b93e017dd15a5212939ffac76614
Closes-Bug: #1626628
|
|
This commit does the following:
1. We now explicitly disable/stop and then remove the resources that are
moving to systemd. We do this because we want to make sure they are all
stopped before doing a yum upgrade, which otherwise would take ages due
to rabbitmq and galera being down. It is best if we do this via pcs
while we do the HA Full -> HA NG migration because it is simpler to make
sure all the services are stopped at that stage. For extra safety we can
still do a check by hand. By doing it via pacemaker we have the
guarantee that all the migrated services are down already when we stop
the cluster (which happens to be a syncronization point between all
controller nodes). That way we can be certain that they are all down on
all nodes before starting the yum upgrade process.
2. We actually need to start the systemd services in
major_upgrade_controller_pacemaker_2.sh and not stop them.
3. We need to use the proper bash variable name
4. Use is_bootstrap_node everywhere to make the code more consistent
Change-Id: Ic565c781b80357bed9483df45a4a94ec0423487c
Closes-Bug: #1627490
|
|
Currently we do not disable openstack-cinder-volume during our
major-upgrade-pacemaker step. This leads to the following scenario. In
major_upgrade_controller_pacemaker_2.sh we do:
start_or_enable_service galera
check_resource galera started 600
....
if [[ -n $(is_bootstrap_node) ]]; then
...
cinder-manage db sync
...
What happens here is that since openstack-cinder-volume was never
disabled it will already be started by pacemaker before we call
cinder-manage and this will give us the following errors during the
start:
06:05:21.861 19482 ERROR cinder.cmd.volume DBError:
(pymysql.err.InternalError) (1054, u"Unknown column 'services.cluster_name' in 'field list'")
Change-Id: I01b2daf956c30b9a4985ea62cbf4c941ec66dcdf
Closes-Bug: #1627470
|
|
Currently we in major_upgrade_controller_pacemaker_2.sh we are calling
ceilometer-dbsync before mongod is actually started (only galera is
started at this point). This will make the dbsync hang indefinitely
until the heat stack times out.
Now this approach should be okay, but do note that when we start mongod
via systemctl we are not guaranteed that it will be up on all nodes
before we call ceilometer-dbsync. This *should* be okay because
ceilometer-dbsync keeps retrying and eventually one of the nodes will
be available. A completely clean fix here would be to add another
step in heat to have the guarantee that all mongo servers are up and
running before the dbsync call.
Change-Id: I10c960b1e0efdeb1e55d77c25aebf1e3e67f17ca
Closes-Bug: #1627453
|
|
In bug https://bugs.launchpad.net/tripleo/+bug/1615035 we fixed the
scheduler_host setting which got deprecated in newton. It seems also the
scheduler_driver settings needs tweaking:
systemctl status openstack-nova-scheduler.service:
2016-09-24 20:24:54.337 15278 WARNING stevedore.named [-] Could not load nova.scheduler.filter_scheduler.FilterScheduler
2016-09-24 20:24:54.338 15278 CRITICAL nova [-] RuntimeError: (u'Cannot load scheduler driver from configuration %(conf)s.',
{'conf': 'nova.scheduler.filter_scheduler.FilterScheduler'})
Let's set this to default during the upgrade step. From newton's nova.conf:
The class of the driver used by the scheduler. This should be chosen
from one of the entrypoints under the namespace 'nova.scheduler.driver'
of file 'setup.cfg'. If nothing is specified in this option, the
'filter_scheduler' is used.
This option also supports deprecated full Python path to the class to
be used. For example, "nova.scheduler.filter_scheduler.FilterScheduler".
But note: this support will be dropped in the N Release.
Change-Id: Ic384292ad05a57757158995ec4c1a269fe4b00f1
Depends-On: I89124ead8928ff33e6b6907a7c2178169e91f4e6
Closes-Bug: #1627450
|
|
With commit fb25385d34e604d2f670cebe3e03fd57c14fa6be
"Rework the pacemaker_common_functions for M..N upgrades" we
accidentally removed some lines that fixed M/N upgrade issues.
Namely:
extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh
-# https://bugzilla.redhat.com/show_bug.cgi?id=1284047
-# Change-Id: Ib3f6c12ff5471e1f017f28b16b1e6496a4a4b435
-crudini --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend rabbit
-# https://bugzilla.redhat.com/show_bug.cgi?id=1284058
-# Ifd1861e3df46fad0e44ff9b5cbd58711bbc87c97 Swift Ceilometer middleware no longer exists
-crudini --set /etc/swift/proxy-server.conf pipeline:main pipeline "catch_errors healthcheck cache ratelimit tempurl formpost authtoken keystone staticweb proxy-logging proxy-server"
-# LP: 1615035, required only for M/N upgrade.
-crudini --set /etc/nova/nova.conf DEFAULT scheduler_host_manager host_manager
extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
nova-manage db sync
- nova-manage api_db sync
This patch simply puts that code back without reverting the
whole commit that broke things, because that is needed.
Closes-Bug: #1627448
Change-Id: I89124ead8928ff33e6b6907a7c2178169e91f4e6
|
|
Running upgrade-non-controller.sh against compute and object storage did
not fail if the /root/tripleo_upgrade_node.sh failed.
This make it harder to detect error in CI system for instance.
Change-Id: I12b7d640547d3b8ec1f70104d159d6052b7638ff
Closes-Bug: 1620973
|
|
This is the initial work to have a function that migrates a full HA
architecture as deployed in Mitaka to the HA architecture as deployed in
Newton where only a few resources are managed by pacemaker.
The sequence is the following:
1) We remove the desired services from pacemaker's control. The services
at this point are still running normally via the systemd service as
invoked by pacemaker
2) We do a "systemctl stop <service>" on all controllers for all the
services that were removed from pacemaker's control. We do this to make
sure that during the yum upgrade, the %post sections that call
"systemctl try-restart" do not take ages, because at this point during
the upgrade rabbit is down. The only exceptions are "openstack-core"
and "delay" which are dummy pacemaker resources that do not exist on
the system
3) We do a "systemctl start <service>" on all nodes for all the services
mentioned above.
We should probably merge this patch only when newton has branched as it
is very specific to the M/N upgrade.
Closes-Bug: 1617520
Change-Id: I4c409ce58c1a57b6e0decc3cf168b62698b32e39
|
|
Change-Id: I7a041dab8b1b1edc9c80248e1eef3ce7ab272292
Closes-Bug: 1615056
|
|
For N we cannot assume services are managed by pacemaker.
This adds functions to check if a service is systemd or
pcmk managed and start/stops it accordingly. For pcmk,
only stop/disable on bootstrap node for example, whereas
systemd should stop/start on all controllers.
There is also an equivalent change to the check_resource
which has been reworked to allow both pcmk and systemd.
Implements: blueprint overcloud-upgrades-workflow-mitaka-to-newton
Change-Id: Ic8252736781dc906b3aef8fc756eb8b2f3bb1f02
|
|
|
|
|
|
|
|
The batch_create and rolling_update keys were incorrectly defined
as properties of the resource instead of update policies.
Change-Id: I19261adc78e4cdc3616f16221e85490a6b48d47b
Closes-Bug: 1623506
|
|
The Ceph upgrade scripts was failing on the following:
1. a syntax error in an if condition
2. an attempt to read a possibly unbound variable
3. an attempt to chown a directory which might not exist
this change aims at fixing all of the above.
Closes-Bug: 1623942
Change-Id: I9e9d63d4ab7626893aaf2a25dccfcafbb97ccbdf
|
|
We need to remove the hard-coded roles from overcloud.j2.yaml
as now it's valid to e.g remove BlockStorage completely.
The previous behavior for the per-role upgrade scripts is maintained
but we'll need to rework this for newton->ocata upgrades where we
can no longer be sure the servers mapping will contain all roles.
Change-Id: I25e6c84757e3c00fba2aae834cd8206c62e44acf
Partially-Implements: blueprint custom-roles
|
|
We make it clear that recoverable checks happen before starting the
upgrade to be able to run the upgrade after the offending error has been
manually corrected.
Add new check for the pcsd cluster status.
Add new check for galera password file: BZ 1357112
Closes-Bug: 1614907
Change-Id: If736c79121e1ffe0eaeb814bdb73ccbc0b64edcd
|
|
|
|
|
|
|
|
With new pacemaker architecture, Puppet handles restarts of most of the
services. There are several still managed by pacemaker which need
special restart handling utilizing pacemaker and its resource
agents.
The counterpart in puppet-tripleo requests restarts for individual
pacemaker-managed services by writing out "restart flag" files, and the
pacemaker_resource_restart.sh script then performs the restarts.
Change-Id: Ia4e6a9f88181f1981993f046cf415dbbcdc9570e
Closes-Bug: #1614967
Closes-Bug: #1587015
Depends-On: I6369ab0c82dbf3c8f21043f8aa9ab810744ddc12
|
|
|
|
This will prevent the Ceph Mon upgrade script from starting if the
Ceph cluster is in error state.
It also adds a parameter to ignore warning states, useful when
performing an upgrade of a cluster where the number of healthy
OSDs does not guarantee the desired replica size.
Closes-Bug: 1618533
Change-Id: I1beb8ad0812f19b1018ba19b5a9fc85fa132d7f7
|
|
Upgrades the ceph-osd daemon from Hammer to Jewel
Change-Id: Idfa90fdc0052c53f448401c85c5d13a2ba68acd1
|
|
|
|
|
|
Adds a pre-requisite software deployment to the pacemaker scenario
upgrade which, before the openstack services are upgraded,
upgrades the ceph-mon daemon from Hammer to Jewel.
Change-Id: I9855d80a6ae156b4a9e0409c3c927068b9db95a0
|
|
|
|
|
|
We have to recreate the /var/lib/mysql directory on all controller node,
not just the boostrap node for the cluster to be able to restart.
Adding a warning on the fact that those script are local and know
nothing about the good upgrade state of the other nodes.
Closes-Bug: 1612642
Change-Id: I48e2812d7df80bbf2db53a8b71dc434d4209a160
|
|
There is a typo in the code, making this test always successful.
Closes-Bug: 1614437
Change-Id: Ia6b0b156294de9fcb8f66fc46aa8801555775a56
|
|
scheduler_host_manager doesn't take
nova.scheduler.host_manager.HostManager as a value anymore. This fix it
before restarting the service.
Change-Id: Ia9adcfd5a898f0c712b4a37ae33db88a44630f0d
Closes-Bug: 1615035
|
|
|
|
|
|
These are not needed any more as they were specific to
mitaka upgrades.
Change-Id: I0d421b942e620403f88374e1c82105747d8d84c9
|