aboutsummaryrefslogtreecommitdiffstats
path: root/extraconfig/tasks
AgeCommit message (Collapse)AuthorFilesLines
2016-01-05Bump the pacemaker service op_params to 200s for start and stopmarios1-2/+2
Based on observed timeouts during updates bump the stop and start timeouts for pacemaker service resources (via op_params) to 200. This is based on the reasoning that the full timeout may be as long as two elapsed timeout intervals. After an initial timeout, the sigterm that follows is then allowed another DefaultTimeoutStopSec seconds. The 200s is produced by allowing this 2xDefaultTimeoutStopSec (@90s for systemd) and some scheduling delta. Many thanks to Michele Baldessari. Closes-Bug: 1531204 Change-Id: If6b43982c958f63bc78ad997400bf1279c23df7e
2016-01-05Merge "Wait for cluster to settle in yum_update.sh"Jenkins1-1/+11
2016-01-04Wait for cluster to settle in yum_update.shJiri Stransky1-1/+11
Occasionally we hit "Error: unable to push cib" during update. This is probably due to the fact that when we try to replace cib in yum_update.sh, services on the previous updated controller are still coming up and changing cib, and racing/conflicting with the cib push from yum_update.sh. This commit adds waiting for the cluster to settle before exiting from yum_update.sh, to avoid this kind of conflict. Also a check for cib-push success is added, to make the update fail properly instead of hanging indefinitely as we've observed with this issue. Change-Id: I953087e0e565474ac553fd57bea2459d2e3a6081 Closes-Bug: #1527644
2015-12-15Add fixup for pcs order constraints after update to new templatesmarios1-0/+6
In https://review.openstack.org/#/c/248572/ yum_update.sh sets the pcs constraints before restarting the cluster. However after post-update pacemaker run, the previous constraint of neutron-server...neutron-ovs-cleanup is re-added. Explicitly remove this before the post-update restart of certain services Change-Id: I84dd650dcc66ce3f48926cf369b7d691014c2254
2015-12-14Pacemaker maintenance mode for the duration of Puppet run on updateSteven Hardy4-0/+147
This enables pacemaker maintenantce mode when running Puppet on stack update. Puppet can try to restart some overcloud services, which pacemaker tries to prevent, and this can result in a failed Puppet run. At the end of the puppet run, certain pacemaker resources are restarted in an additional SoftwareDeployment to make sure that any config changes have been fully applied. This is only done on stack updates (when UpdateIdentifier is set to something), because the assumption is that on stack create services already come up with the correct config. (Change I9556085424fa3008d7f596578b58e7c33a336f75 has been squashed into this one.) Change-Id: I4d40358c511fc1f95b78a859e943082aaea17899 Co-Authored-By: Jiri Stransky <jistr@redhat.com> Co-Authored-By: James Slagle <jslagle@redhat.com>
2015-12-03Merge "Add pcmk constraints against haproxy-clone only if applicable"Jenkins1-32/+34
2015-12-03Merge "Apply mongod timeout via cib-push"Jenkins1-1/+1
2015-12-02Add pcmk constraints against haproxy-clone only if applicableGiulio Fidente1-32/+34
When the Overcloud does not host an instance of haproxy, pcmk will not have any resource named haproxy-clone so we should not add any constraint relying on it. Change-Id: I801f07b7570f3805aa71c22998fec6b6f192b350
2015-11-25Merge "Update: clean keepalived and radvd instances after pcs cluster stop"Jenkins1-0/+7
2015-11-25Apply mongod timeout via cib-pushGiulio Fidente1-1/+1
We forgot to apply the mongod timeout in the cib dump first, to apply it later in a single cib-push step. Change-Id: Ib104e51782c6d3f646907cdb06c74fd4cbf9028c
2015-11-24Update: clean keepalived and radvd instances after pcs cluster stopJiri Stransky1-0/+7
Older neutron versions have a bug which makes them leave keepalived and radvd running even after all neutron services are stopped, preventing neutron router failover from happening. Router can then get stuck on the inactive node, like this: [stack@instack ~]$ neutron l3-agent-list-hosting-router default_router +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 48ca9477-b93b-4305-9e6d-9f1c5d3388f0 | overcloud-controller-1.localdomain | True | :-) | standby | | eba0575c-654f-4da6-b1cd-f7fdf1cd3726 | overcloud-controller-2.localdomain | True | :-) | standby | | 68815390-251f-4425-a5f8-38bdbf3bdb90 | overcloud-controller-0.localdomain | True | xxx | active | +--------------------------------------+------------------------------------+----------------+-------+----------+ We need to kill the leftover processes manually to prevent the state described above from happening. See https://review.gerrithub.io/#/c/248931 Change-Id: I2deaa176222983daa0c33ab52a6aa5dbe7365302
2015-11-23Fixup neutron constraints in older overclouds before updatingmarios1-0/+10
The neutron pcs constraints were reworked in https://review.openstack.org/#/c/229466/ For overclouds deployed with older tripleo-heat-templates the current pcs ordering constraints will not have those changes, meaning that the behaviour discussed at https://bugs.launchpad.net/tripleo/+bug/1501378 is likely given we will stop and restart all services. This review applies those, in short, remove the ovs-cleanup after neutron-server and add openvswitch-agent instead. Detail in the bug report and linked BZ. Change-Id: I45822c5fe9029f11635400b7fbd386880ac80a4e Related-Bug: 1501378
2015-11-19Add constraints and timeouts from file in single stepGiulio Fidente1-78/+50
To avoid pcmk reconfiguring the resources on each config change, we want to apply the constraints and timeouts from file. We also *do not* want to alter the timeouts for a few ocf resources which are rabbitmq, neutron-netns-cleanup and neutron-ovs-cleanup Change-Id: I6875f19e1f34f0fdcf0928421f49b61d857ca7c8 Co-Authored-By: Andrew Beekhof <abeekhof@redhat.com>
2015-11-17Verify galera is sync'd in yum_update.shJames Slagle1-0/+12
When the cluster is brought back online after a yum update in yum_update.sh, we should verify that galera is fully sync'd before moving on. This ensures the sync is complete before moving on to update any other nodes in the cluster. Change-Id: Ie8fc2c5d5214deacea94ca658ac75359b318ced1
2015-11-13Set start/stop pacemaker resource timeouts for updatesJiri Stransky1-0/+72
This matches change I6fc18f1ad876c5a25723710a3b20d8ec9519dcba, but we need it to set it before attempting the cluster stop - yum update - cluster start cycle, to make sure this cycle doesn't hit the low timeout limits. This can be removed once updates from deployments made prior to I6fc18f1ad876c5a25723710a3b20d8ec9519dcba are no longer supported. Change-Id: I587136d8d045d213875c657ea5a405074f80c8ad
2015-11-11Add missing constraints in yum_update.shJames Slagle1-0/+30
Some missing pacemaker constraints were added in the following commits: https://review.openstack.org/#/c/219770/ https://review.openstack.org/#/c/219665/ https://review.openstack.org/#/c/218931/ https://review.openstack.org/#/c/218930/ Overclouds that were deployed prior to these constraints being added to tripleo-heat-templates still have the constraints missing. During an update, stopping and starting the cluster can fail without these constraints in place. As a workaround, conditionally add these contraints in yum_update.sh so that we're sure they're always present before updating. Change-Id: Id46c85dbbe5e85d362279661091b17ce1b697fe0
2015-10-01Force stop a single node pacemaker on yum updateSteve Baker1-1/+7
Currently package updates won't occur on a single node non-HA pacemaker managed Controller because stopping the node loses the quorum of 1. This change gets the count of current nodes in the cluster and if the count is 1 then specify --force when doing a pcs cluster stop. Change-Id: I0de2488e24f1ef53a935dbc90ec6de6142bb4264
2015-10-01Make package upgrade pacemaker-awareSteve Baker1-7/+45
This change adds alternative logic for handling package updates on a pacemaker managed node. "yum list updates" is now run and this script exits early if there are no packages to update. If the pacemaker service is not running then the previous puppet logic remains, so a package update is performed which excludes packages managed by puppet, and a flag is set to indicate that puppet should perform an ensure=>latest on all packages it manages. However if the pacemaker service is running, the following occurs: - pcs cluster stop is run for this node - a full yum update is performed - pcs cluster start is run for this node - pcs status is run until the hostname for this node appears in the Online list This means that puppet is not involved in the package update process when the node is managed by pacemaker. Change-Id: I5ad118552d053dbda280978751167d9fd9da9874
2015-10-01Ensure present/latest for puppet driven package updatesSteve Baker2-1/+13
This change updates yum_update.sh so that we set set a boolean output when "managed" packages should get updated. The output is named 'update_managed_packages' and for the puppet implementation it is wired up so that it directly sets tripleo::packages::enable_upgrade to control whether packages are updated. It also modifies yum_update.sh to build a yum update excludes list for packages managed by puppet. The exclude lists are being generated via puppet-tripleo as well via the new 'write_package_names' function that is now wired into all the role manifests. This change does not actually trigger the puppet apply. The fix for Related-Bug: #1463092 will be used to trigger the puppet run when the hiera changes. As a minor tweak to this logic we append the UpdateIdentifier to the config_identifier so that we ensure puppet gets executed on an update where other (non-related) hiera changes also occur. Co-Authored-By: Dan Prince <dprince@redhat.com> Change-Id: I343c3959517eae38bbcd43648ed56f610272864d
2015-06-08Config & deployments to update overcloud packagesSteve Baker2-0/+67
This change adds config and deployment resources to trigger package updates on nodes. The deployments are triggered by doing a stack-update and setting one of the parameters to a unique value. The intent is that rolling update will be controlled by setting breakpoints on all of the UpdateDeployment resources inside the role resource groups. Change-Id: I56bbf944ecd6cbdbf116021b8a53f9f9111c134f