1 files changed, 152 insertions, 0 deletions
diff --git a/networking-odl/doc/source/specs/journal-recovery.rst b/networking-odl/doc/source/specs/journal-recovery.rst
new file mode 100644
index 0000000..1132485
--- /dev/null
+++ b/networking-odl/doc/source/specs/journal-recovery.rst
@@ -0,0 +1,152 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+================
+Journal Recovery
+================
+
+https://blueprints.launchpad.net/networking-odl/+spec/journal-recovery
+
+Journal entries in the failed state need to be handled somehow. This spec will
+try to address the issue and propose a solution.
+
+Problem Description
+===================
+
+Currently there is no handling for Journal entries that reach the failed state.
+A journal entry can reach the failed state for several reasons, some of which
+are:
+
+* Reached maximum failed attempts for retrying the operation.
+
+* Inconsistency between ODL and the Neutron DB.
+
+  * For example: An update fails because the resource doesn't exist in ODL.
+
+* Bugs that can lead to failure to sync up.
+
+These entries will be left in the journal table forever which is a bit wasteful
+since they take up some space on the DB storage and also affect the performance
+of the journal table.
+Albeit each entry has a negligble effect on it's own, the impact of a large
+number of such entries can become quite significant.
+
+Proposed Change
+===============
+
+A "journal recovery" routine will run as part of the current journal
+maintenance process.
+This routine will scan the journal table for rows in the "failed" state and
+will try to sync the resource for that entry.
+
+The procedure can be best described by the following flow chart:
+
+asciiflow::
+
+  +-----------------+
+  | For each entry  |
+  | in failed state |
+  +-------+---------+
+          |
+  +-------v--------+
+  | Query resource |
+  | on ODL (REST)  |
+  +-----+-----+----+
+        |     |                          +-----------+
+     Resource |                          | Determine |
+     exists   +--Resource doesn't exist--> operation |
+        |                                | type      |
+  +-----v-----+                          +-----+-----+
+  | Determine |                                |
+  | operation |                                |
+  | type      |                                |
+  +-----+-----+                                |
+        |              +------------+          |
+        +--Create------> Mark entry <--Delete--+
+        |              | completed  |          |
+        |              +----------^-+       Create/
+        |                         |         Update
+        |                         |            |
+        |          +------------+ |      +-----v-----+
+        +--Delete--> Mark entry | |      | Determine |
+        |          | pending    | |      | parent    |
+        |          +---------^--+ |      | relation  |
+        |                    |    |      +-----+-----+
+  +-----v------+             |    |            |
+  | Compare to +--Different--+    |            |
+  | resource   |                  |            |
+  | in DB      +--Same------------+            |
+  +------------+                               |
+                                               |
+  +-------------------+                        |
+  | Create entry for  <-----Has no parent------+
+  | resource creation |                        |
+  +--------^----------+                  Has a parent
+           |                                   |
+           |                         +---------v-----+
+           +------Parent exists------+ Query parent  |
+                                     | on ODL (REST) |
+                                     +---------+-----+
+  +------------------+                         |
+  | Create entry for <---Parent doesn't exist--+
+  | parent creation  |
+  +------------------+
+
+For every error during the process the entry will remain in failed state but
+the error shouldn't stop processing of further entries.
+
+
+The implementation could be done in two phases where the parent handling is
+done in a a second pahse.
+For the first phase if we detect an entry that is in failed for a create/update
+operation and the resource doesn't exist on ODL we create a new "create
+resource" journal entry for the resource.
+
+This propsal utilises the journal mechanism for it's operation while the only
+part that deviates from the standard mode of operation is when it queries ODL
+directly. This direct query has to be done to get ODL's representation of the
+resource.
+
+Performance Impact
+------------------
+
+The maintenance thread will have another task to handle. This can lead to
+longer processing time and even cause the thread to skip an iteration.
+This is not an issue since the maintenance thread runs in parallel and doesn't
+directly impact the responsiveness of the system.
+
+Since most operations here involve I/O then CPU probably won't be impacted.
+
+Network traffic would be impacted slightly since we will attempt to fetch the
+resource each time from ODL and we might attempt to fetch it's parent.
+This is however negligble as we do this only for failed entries, which are
+expected to appear rarely.
+
+
+Alternatives
+------------
+
+The partial sync process could make this process obsolete (along with full
+sync), but it's a far more complicated and problematic process.
+It's better to start with this process which is more lightweight and doable
+and consider partial sync in the future.
+
+
+Assignee(s)
+===========
+
+Primary assignee:
+  mkolesni <mkolesni@redhat.com>
+
+Other contributors:
+  None
+
+
+References
+==========
+
+https://goo.gl/IOMpzJ
+