Age | Commit message (Collapse) | Author | Files | Lines |
|
It turns out that reducing number of rabbitmq queues in cluster
significantly improves performance of cluster especially in the case of
failover recovery time. Right now the cluster uses ha-all mode for rabbitmq
queues.
It is best to change this to "ha-exactly" mode and reduce the number
of queue copies to ceil(N/2) where N is number of controllers in the
cluster - so in typical scenario of 3 controller It would be 2 by
default.
It does not make much sense to keep the copies of queues over whole
cluster since if the quorum of nodes is lost then the rest of cluster
nodes will be stopped anyway. We let the user override this with a
parameter.
I.e. for a 3 node controlplane cluster we will go from this:
pcs resource show rabbitmq
Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}"
To this:
pcs resource show rabbitmq
Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"exactly","ha-params":2}"
According to Marin Krcmarik's testing recovery time from failure was
reduced significantly.
Partial-Bug: #1628998
Change-Id: Iace6daf27a76cb8ef1050ada0de7ff1f530916c6
|
|
|
|
|
|
|
|
This removes the (nearly empty) per role manifests, and
replaces them with a generic manifest, where we use str_replace
to substitute the role name at runtime (or in some cases a
subset of the name for backwards compatibility)
Change-Id: I79da0f523189959b783bbcbb3b0f37be778e02fe
Partial-Bug: #1626976
|
|
They are now normalized and set in puppet-tripleo.
Change-Id: I197481c577b85894178e7899a55869da47847755
Closes-Bug: #1629279
Depends-On: Ic6de09acf0d36ca90cc2041c0add1bc2b4a369a5
|
|
Add redis_password parameter in Hiera so we can re-use it from
puppet-tripleo later for Aodh, Ceilometer and Gnocchi.
Change-Id: I038e2bac22e3bfa5047d2e76e23cff664546464d
Partial-Bug: #1629279
|
|
|
|
|
|
|
|
|
|
|
|
This will be used for internal (or even public) TLS, for when
certmonger is generating the certificates. This same setting is used
for the undercloud with the generate_service_certificate option.
Change-Id: Ic54fe512b9ed5c71417a66491b7954e653f660b6
|
|
Moving the rest of the static based resource registry
entries to j2, this allows to extend the content of the
template to the roles_list.
Also moved the templates to correspond with the role name.
Partial-Bug: #1626976
Change-Id: I1cbe101eb4ce5a89cba5f2cc45cace43d3380f22
|
|
|
|
|
|
|
|
Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482
Related-Bug: #1626592
|
|
|
|
|
|
Previously the chown command wasn't traversing symlinks, causing
the new ownership to not be set for some needed files.
This change also ensures the crush map tunables are set to the 'default'
profile after the upgrade.
Finally redirects the output of a pidof to /dev/null to avoid spurious
logging.
Change-Id: Id4865ffff207edfc727d729f9cc04e6e81ad19d8
|
|
|
|
The default resource-registry file contains a bunch of per-role
things which mean you need to cut/paste into a custom environment
file for custom roles, even if you only want the defaults like the
built-in roles. Using j2 we can template these just like in the
overcloud.j2.yaml and other files.
Change-Id: I52a9bffd043ca8fb0f05077c8a401a68def82926
Partial-Bug: #1626976
|
|
Before this change we checked the cluster for any failed actions and
we stopped the upgrade process if there were any.
This is likely eccessive as a failed action could have happened in the
past and the cluster is now fully functional.
Better to check if any of the resources are in Stopped state and break
the upgrade process if any of them are.
We also need to restrict this check to the bootstrap node because
otherwise the following might happen:
1) Bootstrap node does the check, it is successful and it starts
the full HA -> HA NG migration which *will* create failed actions
and will start stopping resources
2) If the check now starts on a non-bootstrap node while 1) is ongoing,
it will find either failed actions or stopped resources so it will
fail.
Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f
Closes-Bug: #1628653
|
|
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
has the following code:
...
check_resource mongod started 600
if [[ -n $(is_bootstrap_node) ]]; then
...
tstart=$(date +%s)
while ! clustercheck; do
sleep 5
tnow=$(date +%s)
if (( tnow-tstart > galera_sync_timeout )) ; then
echo_error "ERROR galera sync timed out"
exit 1
fi
done
# Run all the db syncs
cinder-manage db sync
...
fi
start_or_enable_service rabbitmq
check_resource rabbitmq started 600
start_or_enable_service redis
check_resource redis started 600
start_or_enable_service openstack-cinder-volume
check_resource openstack-cinder-volume started 600
systemctl_swift start
for service in $(services_to_migrate); do
manage_systemd_service start "${service%%-clone}"
check_resource_systemd "${service%%-clone}" started 600
done
"""
The problem with the above code is that it is open to the following race
condition:
1) Bootstrap node is busy checking the galera status via cluster check
2) Non-bootstrap node has already reached: start_or_enable_service
rabbitmq and later lines. These lines will be skipped because
start_or_enable_service is a noop on non-bootstrap nodes and
check_resource rabbitmq only checks that pcs status |grep rabbitmq
returns true.
3) Non-bootstrap node can then reach the manage_systemd_service start
and it will fail with stuff like:
"Job for openstack-nova-scheduler.service failed because the control
process exited with error code. See \"systemctl status
openstack-nova-scheduler.service\" and \"journalctl -xe\" for
details.\n" (because the db tables are not migrated yet)
This happens because 3) was started on non-bootstrap nodes before the
db-sync statements are complete on the bootstrap node. I did not feel
like changing the semantics of check_resource and remove the noop on
non-bootstrap nodes as other parts of the tree might rely on this
behaviour.
Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed
Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9
Closes-Bug: #1627965
|
|
We call gnocchi-upgrade to make sure we update all the needed schemas
during the major-upgrade-pacemaker step.
We also make sure that redis is started before we call gnocchi-upgrade
otherwise the command will be stuck in a loop trying to contact redis.
Closes-Bug: #1626592
Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
|
|
|
|
This patch movs the various db::mysql hiera settings into a
'mysql' specific service_config_settings section for each
service so that these will only get applied on the MySQL service
node. This follows a similar puppet-tripleo change where we
create the actual databases for all services locally on
the MySQL service node to avoid permission issues.
Change-Id: Ic0692b1f7aa8409699630ef3924c4be98ca6ffb2
Closes-bug: #1620595
Depends-On: I05cc0afa9373429a3197c194c3e8f784ae96de5f
Depends-On: I5e1ef2dc6de6f67d7c509e299855baec371f614d
|
|
Currently we do the following in the migration path:
pcs property set maintenance-mode=true
if ! timeout -k 10 300 crm_resource --wait; then
echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting."
exit 1
fi
crm_resource --wait can actually take forever under certain conditions.
The property will be set atomically across the cluster nodes so we should be good
without this.
Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04
Closes-Bug: #1628393
|
|
After a successful upgrade to Newton, I ran the tripleo.sh
--overcloud-pingtest and it failed with the following:
resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409)
The issue is the fact that some tables have migrated to the
nova_api db and we need to migrate the data as well.
Currently we do:
nova-manage db sync
nova-manage api_db sync
We want to add:
nova-manage db online_data_migrations
After launching this command the overcloud-pingtest works correctly:
tripleo.sh -- Overcloud pingtest SUCCEEDED
Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096
Closes-Bug: #1628450
|
|
|
|
This patch enables correctly setting the NTP server passed via
--ntp-server in the overcloud nodes' /etc/ntp.conf.
Change-Id: Iff644b9da51fb8cd1946ad9d297ba0e94d3d782b
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Without setting this parameter, overcloud deploy fails and
'openstack stack failures list overcloud' reveals the
following error:
Error: Puppet::Type::Keystone_user_role::ProviderOpenstack: Could
not find project with name [services] and domain [Default]
Error:
/Stage[main]/Manila::Keystone::Auth/Keystone::Resource::Service_identity[manilav2]/Keystone_user_role[manilav2@services]:
Could not evaluate: undefined method `[]' for nil:NilClass
When we set manila::keystone::auth::tenant to 'service', analogous
to cinder, nova, etc., the overcloud deploy completes successfully.
Change-Id: I996ac2ff602c632a9f9ea9c293472a6f2f92fd72
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
As noted in the bug, predictable placement is broken right now
because the %index% in the scheduler hint isn't being interpolated.
This is because the parameter was moved from overcloud.yaml to the
service-specific files, which doesn't provide the index value.
Because the Compute role's parameter is named NovaCompute... we also
have to include some backwards compatibility logic to handle the
mismatch.
Change-Id: Ibee2949fe4c6c707203d7250e2ce169c769b1dcd
Closes-Bug: 1627858
|
|
The paramater IgnoreCephUpgradeWarnings is type cast into a boolean
which is rendered as 'True' or 'False' as a string not 'true' or
'false'. This fix the check.
Change-Id: I8840c384d07f9d185a72bde5f91a3872a321f623
Closes-Bug: 1627736
|
|
|