cyborg_enhancement/mitaka_version/cyborg/doc/source/devdoc/specs/pike/approved/cyborg-agent.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

==========================================
     Cyborg Agent Proposal
==========================================

https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-agent

This spec proposes the responsibilities and initial design of the
Cyborg Agent.

Problem description
===================

Cyborg requires an agent on the compute hosts to manage the several
responsibilities, including locating accelerators, monitoring their
status, and orchestrating driver operations.

Use Cases
---------

Use of accelerators attached to virtual machine instances in OpenStack

Proposed change
===============

Cyborg Agent resides on various compute hosts and monitors them for accelerators.
On it's first run Cyborg Agent will run the detect accelerator functions of all
it's installed drivers. The resulting list of accelerators available on the host
will be reported to the conductor where it will be stored into the database and
listed during API requests. By default accelerators will be inserted into the
database in a inactive state. It will be up to the operators to manually set
an accelerator to 'ready' at which point cyborg agent will be responsible for
calling the drivers install function and ensuring that the accelerator is ready
for use.

In order to mirror the current Nova model of using the placement API each Agent
will send updates on it's resources directly to the placement API endpoint as well
as to the conductor for usage aggregation. This should keep placement API up to date
on accelerators and their usage.

Alternatives
------------

There are lots of alternate ways to lay out the communication between the Agent
and the API endpoint or the driver. Almost all of them involving exactly where we
draw the line between the driver, Conductor , and Agent. I've written my proposal
with the goal of having the Agent act mostly as a monitoring tool, reporting to
the cloud operator or other Cyborg components to take action. A more active role
for Cyborg Agent is possible but either requires significant synchronization with
the Conductor or potentially steps on the toes of operators.

Data model impact
-----------------

Cyborg Agent will create new entries in the database for accelerators it detects
it will also update those entries with the current status of the accelerator
at a high level. More temporary data like the current usage of a given accelerator
will be broadcast via a message passing system and won't be stored.

Cyborg Agent will retain a local cache of this data with the goal of not losing accelerator
state on system interruption or loss of connection.


REST API impact
---------------

TODO once we firm up who's responsible for what.

Security impact
---------------

Monitoring capability might be useful to an attacker, but without root
this is a fairly minor concern.

Notifications impact
--------------------

Notifying users that their accelerators are ready?

Other end user impact
---------------------

Interaction details around adding/removing/setting up accelerators
details TBD.

Performance Impact
------------------

Agent heartbeat for updated accelerator performance stats might make
scaling to many accelerator hosts a challenge for the Cyborg endpoint
and database. Perhaps we should consider doing an active 'load census'
before scheduling instances? But that just moves the problem from constant
load to issues with a bootstorm.


Other deployer impact
---------------------

By not placing the drivers with the Agent we keep the deployment footprint
pretty small. We do add development complexity and security concerns sending
them over the wire though.

Developer impact
----------------

TBD

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  <jkilpatr>

Other contributors:
  <launchpad-id or None>

Work Items
----------

* Agent implementation

Dependencies
============

* Cyborg Driver Spec
* Cyborg API Spec
* Cyborg Conductor Spec

Testing
=======

CI infrastructure with a set of accelerators, drivers, and hardware will be
required for testing the Agent installation and operation regularly.

Documentation Impact
====================

Little to none. Perhaps on an on compute config file that may need to be
documented. But I think it's best to avoid local configuration where possible.

References
==========

Other Cyborg Specs

History
=======


.. list-table:: Revisions
   :header-rows: 1

   * - Release
     - Description
   * - Pike
     - Introduced