diff options
Diffstat (limited to 'networking-odl/doc/source/specs/journal-recovery.rst')
-rw-r--r-- | networking-odl/doc/source/specs/journal-recovery.rst | 152 |
1 files changed, 152 insertions, 0 deletions
diff --git a/networking-odl/doc/source/specs/journal-recovery.rst b/networking-odl/doc/source/specs/journal-recovery.rst new file mode 100644 index 0000000..1132485 --- /dev/null +++ b/networking-odl/doc/source/specs/journal-recovery.rst @@ -0,0 +1,152 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +================ +Journal Recovery +================ + +https://blueprints.launchpad.net/networking-odl/+spec/journal-recovery + +Journal entries in the failed state need to be handled somehow. This spec will +try to address the issue and propose a solution. + +Problem Description +=================== + +Currently there is no handling for Journal entries that reach the failed state. +A journal entry can reach the failed state for several reasons, some of which +are: + +* Reached maximum failed attempts for retrying the operation. + +* Inconsistency between ODL and the Neutron DB. + + * For example: An update fails because the resource doesn't exist in ODL. + +* Bugs that can lead to failure to sync up. + +These entries will be left in the journal table forever which is a bit wasteful +since they take up some space on the DB storage and also affect the performance +of the journal table. +Albeit each entry has a negligble effect on it's own, the impact of a large +number of such entries can become quite significant. + +Proposed Change +=============== + +A "journal recovery" routine will run as part of the current journal +maintenance process. +This routine will scan the journal table for rows in the "failed" state and +will try to sync the resource for that entry. + +The procedure can be best described by the following flow chart: + +asciiflow:: + + +-----------------+ + | For each entry | + | in failed state | + +-------+---------+ + | + +-------v--------+ + | Query resource | + | on ODL (REST) | + +-----+-----+----+ + | | +-----------+ + Resource | | Determine | + exists +--Resource doesn't exist--> operation | + | | type | + +-----v-----+ +-----+-----+ + | Determine | | + | operation | | + | type | | + +-----+-----+ | + | +------------+ | + +--Create------> Mark entry <--Delete--+ + | | completed | | + | +----------^-+ Create/ + | | Update + | | | + | +------------+ | +-----v-----+ + +--Delete--> Mark entry | | | Determine | + | | pending | | | parent | + | +---------^--+ | | relation | + | | | +-----+-----+ + +-----v------+ | | | + | Compare to +--Different--+ | | + | resource | | | + | in DB +--Same------------+ | + +------------+ | + | + +-------------------+ | + | Create entry for <-----Has no parent------+ + | resource creation | | + +--------^----------+ Has a parent + | | + | +---------v-----+ + +------Parent exists------+ Query parent | + | on ODL (REST) | + +---------+-----+ + +------------------+ | + | Create entry for <---Parent doesn't exist--+ + | parent creation | + +------------------+ + +For every error during the process the entry will remain in failed state but +the error shouldn't stop processing of further entries. + + +The implementation could be done in two phases where the parent handling is +done in a a second pahse. +For the first phase if we detect an entry that is in failed for a create/update +operation and the resource doesn't exist on ODL we create a new "create +resource" journal entry for the resource. + +This propsal utilises the journal mechanism for it's operation while the only +part that deviates from the standard mode of operation is when it queries ODL +directly. This direct query has to be done to get ODL's representation of the +resource. + +Performance Impact +------------------ + +The maintenance thread will have another task to handle. This can lead to +longer processing time and even cause the thread to skip an iteration. +This is not an issue since the maintenance thread runs in parallel and doesn't +directly impact the responsiveness of the system. + +Since most operations here involve I/O then CPU probably won't be impacted. + +Network traffic would be impacted slightly since we will attempt to fetch the +resource each time from ODL and we might attempt to fetch it's parent. +This is however negligble as we do this only for failed entries, which are +expected to appear rarely. + + +Alternatives +------------ + +The partial sync process could make this process obsolete (along with full +sync), but it's a far more complicated and problematic process. +It's better to start with this process which is more lightweight and doable +and consider partial sync in the future. + + +Assignee(s) +=========== + +Primary assignee: + mkolesni <mkolesni@redhat.com> + +Other contributors: + None + + +References +========== + +https://goo.gl/IOMpzJ + |