summaryrefslogtreecommitdiffstats
path: root/docs/testing/user/testspecification/highavailability/index.rst
blob: 1dd99d415898d12f546ffd9fc5e17c788e298f9a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
.. This work is licensed under a Creative Commons Attribution 4.0 International
.. License.
.. http://creativecommons.org/licenses/by/4.0
.. (c) OPNFV, China Mobile and others.

==========================================
OpenStack Services HA  test specification
==========================================

.. toctree::
   :maxdepth: 2

Scope
=====

The HA test area evaluates the ability of the System Under Test to support service
continuity and recovery from component failures on part of OpenStack controller services("nova-api",
"neutron-server", "keystone", "glance-api", "cinder-api") and on "load balancer" service.

The tests in this test area will emulate component failures by killing the
processes of above target services, stressing the CPU load or blocking
disk I/O on the selected controller node, and then check if the impacted
services are still available and the killed processes are recovered on the
selected controller node within a given time interval.


References
================

This test area references the following specifications:

- ETSI GS NFV-REL 001

  - http://www.etsi.org/deliver/etsi_gs/NFV-REL/001_099/001/01.01.01_60/gs_nfv-rel001v010101p.pdf

- OpenStack High Availability Guide

  - https://docs.openstack.org/ha-guide/


Definitions and abbreviations
=============================

The following terms and abbreviations are used in conjunction with this test area

- SUT - system under test
- Monitor - tools used to measure the service outage time and the process
  outage time
- Service outage time - the outage time (seconds) of the specific OpenStack
  service
- Process outage time - the outage time (seconds) from the specific processes
  being killed to recovered


System Under Test (SUT)
=======================

The system under test is assumed to be the NFVi and VIM in operation on a
Pharos compliant infrastructure.

SUT is assumed to be in high availability configuration, which typically means
more than one controller nodes are in the System Under Test.

Test Area Structure
====================

The HA test area is structured with the following test cases in a sequential
manner.

Each test case is able to run independently. Preceding test case's failure will
not affect the subsequent test cases.

Preconditions of each test case will be described in the following test
descriptions.


Test Descriptions
=================

---------------------------------------------------------------
Test Case 1 - Controller node OpenStack service down - nova-api
---------------------------------------------------------------

Short name
----------

dovetail.ha.tc001.nova-api_service_down

Use case specification
----------------------

This test case verifies the service continuity capability in the face of the
software process failure. It kills the processes of OpenStack "nova-api"
service on the selected controller node, then checks whether the "nova-api"
service is still available during the failure, by creating a VM then deleting
the VM, and checks whether the killed processes are recovered within a given
time interval.


Test preconditions
------------------

There is more than one controller node, which is providing the "nova-api"
service for API end-point.
Denoted a controller node as Node1 in the following configuration.


Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for verifying service continuity and recovery
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''

The service continuity and process recovery capabilities of "nova-api" service
is evaluated by monitoring service outage time, process outage time, and results
of nova operations.

Service outage time is measured by continuously executing "openstack server list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "nova-api" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is measured by checking the status of "nova-api" processes on
the selected controller node. The time of "nova-api" processes being killed to
the time of the "nova-api" processes being recovered is the process outage time.
Process recovery is verified by checking the existence of "nova-api" processes.

All nova operations are carried out correctly within a given time interval which
suggests that the "nova-api" service is continuously available.

Test execution
''''''''''''''
* Test action 1: Connect to Node1 through SSH, and check that "nova-api"
  processes are running on Node1
* Test action 2: Create a image with "openstack image create test-cirros
  --file cirros-0.3.5-x86_64-disk.img --disk-format qcow2 --container-format bare"
* Test action 3: Execute"openstack flavor create m1.test --id auto --ram 512
  --disk 1 --vcpus 1" to create flavor "m1.test".
* Test action 4: Start two monitors: one for "nova-api" processes and the other
  for "openstack server list" command.
  Each monitor will run as an independent process
* Test action 5: Connect to Node1 through SSH, and then kill the "nova-api"
  processes
* Test action 6: When "openstack server list" returns with no error, calculate
  the service outage time, and execute command "openstack server create
  --flavor m1.test --image test-cirros test-instance"
* Test action 7: Continuously Execute "openstack server show test-instance"
  to check if the status of VM "test-instance" is "Active"
* Test action 8: If VM "test-instance" is "Active", execute "openstack server
  delete test-instance", then execute "openstack server list" to check if the
  VM is not in the list
* Test action 9: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

The nova operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the process of "nova-api" if they are not running.
Delete image with "openstack image delete test-cirros"
Delete flavor with "openstack flavor delete m1.test"


---------------------------------------------------------------------
Test Case 2 - Controller node OpenStack service down - neutron-server
---------------------------------------------------------------------

Short name
----------

dovetail.ha.tc002.neutron-server_service_down

Use case specification
----------------------

This test verifies the high availability of the "neutron-server" service
provided by OpenStack controller nodes. It kills the processes of OpenStack
"neutron-server" service on the selected controller node, then checks whether
the "neutron-server" service is still available, by creating a network and
deleting the network, and checks whether the killed processes are recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "neutron-server"
service for API end-point.
Denoted a controller node as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "neutron-server" service is evaluated by monitoring
service outage time, process outage time, and results of neutron operations.

Service outage time is tested by continuously executing "openstack router list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "neutron-server" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of "neutron-server"
processes on the selected controller node. The time of "neutron-server"
processes being killed to the time of the "neutron-server" processes being
recovered is the process outage time. Process recovery is verified by checking
the existence of "neutron-server" processes.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that "neutron-server"
  processes are running on Node1
* Test action 2: Start two monitors: one for "neutron-server" process and the
  other for "openstack router list" command.
  Each monitor will run as an independent process.
* Test action 3: Connect to Node1 through SSH, and then kill the
  "neutron-server" processes
* Test action 4: When "openstack router list" returns with no error, calculate
  the service outage time, and execute "openstack network create test-network"
* Test action 5: Continuously executing "openstack network show test-network",
  check if the status of "test-network" is "Active"
* Test action 6: If "test-network" is "Active", execute "openstack network
  delete test-network", then execute "openstack network list" to check if the
  "test-network" is not in the list
* Test action 7: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

The neutron operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the processes of "neutron-server" if they are not running.


---------------------------------------------------------------
Test Case 3 - Controller node OpenStack service down - keystone
---------------------------------------------------------------

Short name
----------

dovetail.ha.tc003.keystone_service_down

Use case specification
----------------------

This test verifies the high availability of the "keystone" service provided by
OpenStack controller nodes. It kills the processes of OpenStack "keystone"
service on the selected controller node, then checks whether the "keystone"
service is still available by executing command "openstack user list" and
whether the killed processes are recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "keystone"
service for API end-point.
Denoted a controller node as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "keystone" service is evaluated by monitoring service
outage time and process outage time

Service outage time is tested by continuously executing "openstack user list"
command in loop and checking if the response of the command request is reutrned
with no failure.
When the response fails, the "keystone" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of "keystone" processes on
the selected controller node. The time of "keystone" processes being killed to
the time of the "keystone" processes being recovered is the process outage
time. Process recovery is verified by checking the existence of "keystone"
processes.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that "keystone"
  processes are running on Node1
* Test action 2: Start two monitors: one for "keystone" process and the other
  for "openstack user list" command.
  Each monitor will run as an independent process.
* Test action 3: Connect to Node1 through SSH, and then kill the "keystone"
  processes
* Test action 4: Calculate the service outage time and process outage time
* Test action 5: The test passes if process outage time is less than 20s and
  service outage time is less than 5s
* Test action 6: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the processes of "keystone" if they are not running.


-----------------------------------------------------------------
Test Case 4 - Controller node OpenStack service down - glance-api
-----------------------------------------------------------------

Short name
----------

dovetail.ha.tc004.glance-api_service_down

Use case specification
----------------------

This test verifies the high availability of the "glance-api" service provided
by OpenStack controller nodes. It kills the processes of OpenStack "glance-api"
service on the selected controller node, then checks whether the "glance-api"
service is still available, by creating image and deleting image, and checks
whether the killed processes are recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "glance-api"
service for API end-point.
Denoted a controller node as Node1 in the following configuration.


Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "glance-api" service is evaluated by monitoring
service outage time, process outage time, and results of glance operations.

Service outage time is tested by continuously executing "openstack image list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "glance-api" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of "glance-api" processes
on the selected controller node. The time of "glance-api" processes being
killed to the time of the "glance-api" processes being recovered is the process
outage time. Process recovery is verified by checking the existence of
"glance-api" processes.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that "glance-api"
  processes are running on Node1
* Test action 2: Start two monitors: one for "glance-api" process and the other
  for "openstack image list" command.
  Each monitor will run as an independent process.
* Test action 3: Connect to Node1 through SSH, and then kill the "glance-api"
  processes
* Test action 4: When "openstack image list" returns with no error, calculate
  the service outage time, and execute "openstack image create test-image
  --file cirros-0.3.5-x86_64-disk.img --disk-format qcow2 --container-format bare"
* Test action 5: Continuously execute "openstack image show test-image", check
  if status of "test-image" is "active"
* Test action 6: If "test-image" is "active", execute "openstack image delete
  test-image". Then execute "openstack image list" to check if "test-image" is
  not in the list
* Test action 7: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

The glance operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the processes of "glance-api" if they are not running.

Delete image with "openstack image delete test-image".


-----------------------------------------------------------------
Test Case 5 - Controller node OpenStack service down - cinder-api
-----------------------------------------------------------------

Short name
----------

dovetail.ha.tc005.cinder-api_service_down

Use case specification
----------------------

This test verifies the high availability of the "cinder-api" service provided
by OpenStack controller nodes. It kills the processes of OpenStack "cinder-api"
service on the selected controller node, then checks whether the "cinder-api"
service is still available by executing command "openstack volume list" and
whether the killed processes are recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "cinder-api"
service for API end-point.
Denoted a controller node as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "cinder-api" service is evaluated by monitoring
service outage time and process outage time

Service outage time is tested by continuously executing "openstack volume list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "cinder-api" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of "cinder-api" processes
on the selected controller node. The time of "cinder-api" processes being
killed to the time of the "cinder-api" processes being recovered is the process
outage time. Process recovery is verified by checking the existence of
"cinder-api" processes.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that "cinder-api"
  processes are running on Node1
* Test action 2: Start two monitors: one for "cinder-api" process and the other
  for "openstack volume list" command.
  Each monitor will run as an independent process.
* Test action 3: Connect to Node1 through SSH, and then execute kill the
  "cinder-api" processes
* Test action 4: Continuously measure service outage time from the monitor until
  the service outage time is more than 5s
* Test action 5: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

The cinder operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the processes of "cinder-api" if they are not running.


------------------------------------------------------------
Test Case 6 - Controller Node CPU Overload High Availability
------------------------------------------------------------

Short name
----------

dovetail.ha.tc006.cpu_overload

Use case specification
----------------------

This test verifies the availability of services when one of the controller node
suffers from heavy CPU overload. When the CPU usage of the specified controller
node is up to 100%, which breaks down the OpenStack services on this node,
the Openstack services should continue to be available. This test case stresses
the CPU usage of a specific controller node to 100%, then checks whether all
services provided by the SUT are still available with the monitor tools.

Test preconditions
------------------

There is more than one controller node, which is providing the "cinder-api",
"neutron-server", "glance-api" and "keystone" services for API end-point.
Denoted a controller node as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of related OpenStack service is evaluated by monitoring service
outage time

Service outage time is tested by continuously executing "openstack router list",
"openstack stack list", "openstack volume list", "openstack image list" commands
in loop and checking if the response of the command request is returned with no
failure.
When the response fails, the related service is considered in outage. The time
between the first response failure and the last response failure is considered
as service outage time.


Methodology for stressing CPU usage
'''''''''''''''''''''''''''''''''''

To evaluate the high availability of target OpenStack service under heavy CPU
load, the test case will first get the number of logical CPU cores on the
target controller node by shell command, then use the number to execute 'dd'
command to continuously copy from /dev/zero and output to /dev/null in loop.
The 'dd' operation only uses CPU, no I/O operation, which is ideal for
stressing the CPU usage.

Since the 'dd' command is continuously executed and the CPU usage rate is
stressed to 100%, the scheduler will schedule each 'dd' command to be
processed on a different logical CPU core. Eventually to achieve all logical
CPU cores usage rate to 100%.

Test execution
''''''''''''''

* Test action 1: Start four monitors: one for "openstack image list" command,
  one for "openstack router list" command, one for "openstack stack list"
  command and the last one for "openstack volume list" command. Each monitor
  will run as an independent process.
* Test action 2: Connect to Node1 through SSH, and then stress all logical CPU
  cores usage rate to 100%
* Test action 3: Continuously measure all the service outage times until they are
  more than 5s
* Test action 4: Kill the process that stresses the CPU usage

Pass / fail criteria
''''''''''''''''''''

All the service outage times are less than 5s.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

No impact on the SUT.


-----------------------------------------------------------------
Test Case 7 - Controller Node Disk I/O Overload High Availability
-----------------------------------------------------------------

Short name
----------

dovetail.ha.tc007.disk_I/O_overload

Use case specification
----------------------

This test verifies the high availability of control node. When the disk I/O of
the specific disk is overload, which breaks down the OpenStack services on this
node, the read and write services should continue to be available. This test
case blocks the disk I/O of the specific controller node, then checks whether
the services that need to read or write the disk of the controller node are
available with some monitor tools.

Test preconditions
------------------

There is more than one controller node.
Denoted a controller node as Node1 in the following configuration.
The controller node has at least 20GB free disk space.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of nova service is evaluated by monitoring
service outage time

Service availability is tested by continuously executing
"openstack flavor list" command in loop and checking if the response of the
command request is returned with no failure.
When the response fails, the related service is considered in outage.


Methodology for stressing disk I/O
''''''''''''''''''''''''''''''''''

To evaluate the high availability of target OpenStack service under heavy I/O
load, the test case will execute shell command on the selected controller node
to continuously writing 8kb blocks to /test.dbf

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and then stress disk I/O by
  continuously writing 8kb blocks to /test.dbf
* Test action 2: Start a monitor: for "openstack flavor list" command
* Test action 3: Create a flavor called "test-001"
* Test action 4: Check whether the flavor "test-001" is created
* Test action 5: Continuously measure service outage time from the monitor
  until the service outage time is more than 5s
* Test action 6: Stop writing to /test.dbf and delete file /test.dbf

Pass / fail criteria
''''''''''''''''''''

The service outage time is less than 5s.

The nova operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Delete flavor with "openstack flavor delete test-001".

--------------------------------------------------------------------
Test Case 8 - Controller Load Balance as a Service High Availability
--------------------------------------------------------------------

Short name
----------

dovetail.ha.tc008.load_balance_service_down

Use case specification
----------------------

This test verifies the high availability of "load balancer" service. When
the "load balancer" service of a specified controller node is killed, whether
"load balancer" service on other controller nodes will work, and whether the
controller node will restart the "load balancer" service are checked. This
test case kills the processes of "load balancer" service on the selected
controller node, then checks whether the request of the related OpenStack
command is processed with no failure and whether the killed processes are
recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "load balancer"
service for rest-api. Denoted as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "load balancer" service is evaluated by monitoring
service outage time and process outage time

Service outage time is tested by continuously executing "openstack image list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "load balancer" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of processes of "load
balancer" service on the selected controller node. The time of those processes
being killed to the time of those processes being recovered is the process
outage time.
Process recovery is verified by checking the existence of processes of "load
balancer" service.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that processes of
  "load balancer" service are running on Node1
* Test action 2: Start two monitors: one for processes of "load balancer"
  service and the other for "openstack image list" command. Each monitor will
  run as an independent process
* Test action 3: Connect to Node1 through SSH, and then kill the processes of
  "load balancer" service
* Test action 4: Continuously measure service outage time from the monitor until
  the service outage time is more than 5s
* Test action 5: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------
Restart the processes of "load balancer" if they are not running.