summaryrefslogtreecommitdiffstats
path: root/docs/testing/user/testspecification/highavailability/index.rst
blob: dd98ba94d67a64a8ee33d552dc00742fb4f6297c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
.. This work is licensed under a Creative Commons Attribution 4.0 International
.. License.
.. http://creativecommons.org/licenses/by/4.0
.. (c) OPNFV, China Mobile and others.

==========================================
OpenStack Services HA  test specification
==========================================

.. toctree::
   :maxdepth: 2

Scope
=====

The HA test area evaluates the ability of the System Under Test to support service
continuity and recovery from component failures on part of OpenStack controller services("nova-api",
"neutron-server", "keystone", "glance-api", "cinder-api") and on "load balancer" service.

The tests in this test area will emulate component failures by killing the
processes of above target services, stressing the CPU load or blocking
disk I/O on the selected controller node, and then check if the impacted
services are still available and the killed processes are recovered on the
selected controller node within a given time interval.


References
================

This test area references the following specifications:

- ETSI GS NFV-REL 001

  - http://www.etsi.org/deliver/etsi_gs/NFV-REL/001_099/001/01.01.01_60/gs_nfv-rel001v010101p.pdf

- OpenStack High Availability Guide

  - https://docs.openstack.org/ha-guide/


Definitions and abbreviations
=============================

The following terms and abbreviations are used in conjunction with this test area

- SUT - system under test
- Monitor - tools used to measure the service outage time and the process
  outage time
- Service outage time - the outage time (seconds) of the specific OpenStack
  service
- Process outage time - the outage time (seconds) from the specific processes
  being killed to recovered


System Under Test (SUT)
=======================

The system under test is assumed to be the NFVi and VIM in operation on a
Pharos compliant infrastructure.

SUT is assumed to be in high availability configuration, which typically means
more than one controller nodes are in the System Under Test.

Test Area Structure
====================

The HA test area is structured with the following test cases in a sequential
manner.

Each test case is able to run independently. Preceding test case's failure will
not affect the subsequent test cases.

Preconditions of each test case will be described in the following test
descriptions.


Test Descriptions
=================

---------------------------------------------------------------
Test Case 1 - Controller node OpenStack service down - nova-api
---------------------------------------------------------------

Short name
----------

yardstick.ha.nova_api

Yardstick test case: opnfv_yardstick_tc019.yaml

Use case specification
----------------------

This test case verifies the service continuity capability in the face of the
software process failure. It kills the processes of OpenStack "nova-api"
service on the selected controller node, then checks whether the "nova-api"
service is still available during the failure, by creating a VM then deleting
the VM, and checks whether the killed processes are recovered within a given
time interval.


Test preconditions
------------------

There is more than one controller node, which is providing the "nova-api"
service for API end-point.

Denoted a controller node as Node1 in the following configuration.


Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for verifying service continuity and recovery
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''

The service continuity and process recovery capabilities of "nova-api" service
is evaluated by monitoring service outage time, process outage time, and results
of nova operations.

Service outage time is measured by continuously executing "openstack server list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "nova-api" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is measured by checking the status of "nova-api" processes on
the selected controller node. The time of "nova-api" processes being killed to
the time of the "nova-api" processes being recovered is the process outage time.
Process recovery is verified by checking the existence of "nova-api" processes.

All nova operations are carried out correctly within a given time interval which
suggests that the "nova-api" service is continuously available.

Test execution
''''''''''''''
* Test action 1: Connect to Node1 through SSH, and check that "nova-api"
  processes are running on Node1
* Test action 2: Create a image with "openstack image create test-cirros
  --file cirros-0.3.5-x86_64-disk.img --disk-format qcow2 --container-format bare"
* Test action 3: Execute"openstack flavor create m1.test --id auto --ram 512
  --disk 1 --vcpus 1" to create flavor "m1.test".
* Test action 4: Start two monitors: one for "nova-api" processes and the other
  for "openstack server list" command.
  Each monitor will run as an independent process
* Test action 5: Connect to Node1 through SSH, and then kill the "nova-api"
  processes
* Test action 6: When "openstack server list" returns with no error, calculate
  the service outage time, and execute command "openstack server create
  --flavor m1.test --image test-cirros test-instance"
* Test action 7: Continuously Execute "openstack server show test-instance"
  to check if the status of VM "test-instance" is "Active"
* Test action 8: If VM "test-instance" is "Active", execute "openstack server
  delete test-instance", then execute "openstack server list" to check if the
  VM is not in the list
* Test action 9: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

The nova operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the process of "nova-api" if they are not running.

Delete image with "openstack image delete test-cirros".

Delete flavor with "openstack flavor delete m1.test".


---------------------------------------------------------------------
Test Case 2 - Controller node OpenStack service down - neutron-server
---------------------------------------------------------------------

Short name
----------

yardstick.ha.neutron_server

Yardstick test case: opnfv_yardstick_tc045.yaml

Use case specification
----------------------

This test verifies the high availability of the "neutron-server" service
provided by OpenStack controller nodes. It kills the processes of OpenStack
"neutron-server" service on the selected controller node, then checks whether
the "neutron-server" service is still available, by creating a network and
deleting the network, and checks whether the killed processes are recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "neutron-server"
service for API end-point.

Denoted a controller node as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "neutron-server" service is evaluated by monitoring
service outage time, process outage time, and results of neutron operations.

Service outage time is tested by continuously executing "openstack router list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "neutron-server" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of "neutron-server"
processes on the selected controller node. The time of "neutron-server"
processes being killed to the time of the "neutron-server" processes being
recovered is the process outage time. Process recovery is verified by checking
the existence of "neutron-server" processes.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that "neutron-server"
  processes are running on Node1
* Test action 2: Start two monitors: one for "neutron-server" process and the
  other for "openstack router list" command.
  Each monitor will run as an independent process.
* Test action 3: Connect to Node1 through SSH, and then kill the
  "neutron-server" processes
* Test action 4: When "openstack router list" returns with no error, calculate
  the service outage time, and execute "openstack network create test-network"
* Test action 5: Continuously executing "openstack network show test-network",
  check if the status of "test-network" is "Active"
* Test action 6: If "test-network" is "Active", execute "openstack network
  delete test-network", then execute "openstack network list" to check if the
  "test-network" is not in the list
* Test action 7: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

The neutron operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the processes of "neutron-server" if they are not running.


---------------------------------------------------------------
Test Case 3 - Controller node OpenStack service down - keystone
---------------------------------------------------------------

Short name
----------

yardstick.ha.keystone

Yardstick test case: opnfv_yardstick_tc046.yaml

Use case specification
----------------------

This test verifies the high availability of the "keystone" service provided by
OpenStack controller nodes. It kills the processes of OpenStack "keystone"
service on the selected controller node, then checks whether the "keystone"
service is still available by executing command "openstack user list" and
whether the killed processes are recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "keystone"
service for API end-point.

Denoted a controller node as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "keystone" service is evaluated by monitoring service
outage time and process outage time

Service outage time is tested by continuously executing "openstack user list"
command in loop and checking if the response of the command request is reutrned
with no failure.
When the response fails, the "keystone" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of "keystone" processes on
the selected controller node. The time of "keystone" processes being killed to
the time of the "keystone" processes being recovered is the process outage
time. Process recovery is verified by checking the existence of "keystone"
processes.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that "keystone"
  processes are running on Node1
* Test action 2: Start two monitors: one for "keystone" process and the other
  for "openstack user list" command.
  Each monitor will run as an independent process.
* Test action 3: Connect to Node1 through SSH, and then kill the "keystone"
  processes
* Test action 4: Calculate the service outage time and process outage time
* Test action 5: The test passes if process outage time is less than 20s and
  service outage time is less than 5s
* Test action 6: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the processes of "keystone" if they are not running.


-----------------------------------------------------------------
Test Case 4 - Controller node OpenStack service down - glance-api
-----------------------------------------------------------------

Short name
----------

yardstick.ha.glance_api

Yardstick test case: opnfv_yardstick_tc047.yaml

Use case specification
----------------------

This test verifies the high availability of the "glance-api" service provided
by OpenStack controller nodes. It kills the processes of OpenStack "glance-api"
service on the selected controller node, then checks whether the "glance-api"
service is still available, by creating image and deleting image, and checks
whether the killed processes are recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "glance-api"
service for API end-point.

Denoted a controller node as Node1 in the following configuration.


Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "glance-api" service is evaluated by monitoring
service outage time, process outage time, and results of glance operations.

Service outage time is tested by continuously executing "openstack image list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "glance-api" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of "glance-api" processes
on the selected controller node. The time of "glance-api" processes being
killed to the time of the "glance-api" processes being recovered is the process
outage time. Process recovery is verified by checking the existence of
"glance-api" processes.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that "glance-api"
  processes are running on Node1
* Test action 2: Start two monitors: one for "glance-api" process and the other
  for "openstack image list" command.
  Each monitor will run as an independent process.
* Test action 3: Connect to Node1 through SSH, and then kill the "glance-api"
  processes
* Test action 4: When "openstack image list" returns with no error, calculate
  the service outage time, and execute "openstack image create test-image
  --file cirros-0.3.5-x86_64-disk.img --disk-format qcow2 --container-format bare"
* Test action 5: Continuously execute "openstack image show test-image", check
  if status of "test-image" is "active"
* Test action 6: If "test-image" is "active", execute "openstack image delete
  test-image". Then execute "openstack image list" to check if "test-image" is
  not in the list
* Test action 7: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

The glance operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the processes of "glance-api" if they are not running.

Delete image with "openstack image delete test-image".


-----------------------------------------------------------------
Test Case 5 - Controller node OpenStack service down - cinder-api
-----------------------------------------------------------------

Short name
----------

yardstick.ha.cinder_api

Yardstick test case: opnfv_yardstick_tc048.yaml

Use case specification
----------------------

This test verifies the high availability of the "cinder-api" service provided
by OpenStack controller nodes. It kills the processes of OpenStack "cinder-api"
service on the selected controller node, then checks whether the "cinder-api"
service is still available by executing command "openstack volume list" and
whether the killed processes are recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "cinder-api"
service for API end-point.

Denoted a controller node as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "cinder-api" service is evaluated by monitoring
service outage time and process outage time

Service outage time is tested by continuously executing "openstack volume list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "cinder-api" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of "cinder-api" processes
on the selected controller node. The time of "cinder-api" processes being
killed to the time of the "cinder-api" processes being recovered is the process
outage time. Process recovery is verified by checking the existence of
"cinder-api" processes.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that "cinder-api"
  processes are running on Node1
* Test action 2: Start two monitors: one for "cinder-api" process and the other
  for "openstack volume list" command.
  Each monitor will run as an independent process.
* Test action 3: Connect to Node1 through SSH, and then execute kill the
  "cinder-api" processes
* Test action 4: Continuously measure service outage time from the monitor until
  the service outage time is more than 5s
* Test action 5: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

The cinder operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the processes of "cinder-api" if they are not running.


------------------------------------------------------------
Test Case 6 - Controller Node CPU Overload High Availability
------------------------------------------------------------

Short name
----------

yardstick.ha.cpu_load

Yardstick test case: opnfv_yardstick_tc051.yaml

Use case specification
----------------------

This test verifies the availability of services when one of the controller node
suffers from heavy CPU overload. When the CPU usage of the specified controller
node is up to 100%, which breaks down the OpenStack services on this node,
the Openstack services should continue to be available. This test case stresses
the CPU usage of a specific controller node to 100%, then checks whether all
services provided by the SUT are still available with the monitor tools.

Test preconditions
------------------

There is more than one controller node, which is providing the "cinder-api",
"neutron-server", "glance-api" and "keystone" services for API end-point.

Denoted a controller node as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of related OpenStack service is evaluated by monitoring service
outage time

Service outage time is tested by continuously executing "openstack router list",
"openstack stack list", "openstack volume list", "openstack image list" commands
in loop and checking if the response of the command request is returned with no
failure.
When the response fails, the related service is considered in outage. The time
between the first response failure and the last response failure is considered
as service outage time.


Methodology for stressing CPU usage
'''''''''''''''''''''''''''''''''''

To evaluate the high availability of target OpenStack service under heavy CPU
load, the test case will first get the number of logical CPU cores on the
target controller node by shell command, then use the number to execute 'dd'
command to continuously copy from /dev/zero and output to /dev/null in loop.
The 'dd' operation only uses CPU, no I/O operation, which is ideal for
stressing the CPU usage.

Since the 'dd' command is continuously executed and the CPU usage rate is
stressed to 100%, the scheduler will schedule each 'dd' command to be
processed on a different logical CPU core. Eventually to achieve all logical
CPU cores usage rate to 100%.

Test execution
''''''''''''''

* Test action 1: Start four monitors: one for "openstack image list" command,
  one for "openstack router list" command, one for "openstack stack list"
  command and the last one for "openstack volume list" command. Each monitor
  will run as an independent process.
* Test action 2: Connect to Node1 through SSH, and then stress all logical CPU
  cores usage rate to 100%
* Test action 3: Continuously measure all the service outage times until they are
  more than 5s
* Test action 4: Kill the process that stresses the CPU usage

Pass / fail criteria
''''''''''''''''''''

All the service outage times are less than 5s.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

No impact on the SUT.


-----------------------------------------------------------------
Test Case 7 - Controller Node Disk I/O Overload High Availability
-----------------------------------------------------------------

Short name
----------

yardstick.ha.disk_load

Yardstick test case: opnfv_yardstick_tc052.yaml

Use case specification
----------------------

This test verifies the high availability of control node. When the disk I/O of
the specific disk is overload, which breaks down the OpenStack services on this
node, the read and write services should continue to be available. This test
case blocks the disk I/O of the specific controller node, then checks whether
the services that need to read or write the disk of the controller node are
available with some monitor tools.

Test preconditions
------------------

There is more than one controller node.
Denoted a controller node as Node1 in the following configuration.
The controller node has at least 20GB free disk space.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of nova service is evaluated by monitoring
service outage time

Service availability is tested by continuously executing
"openstack flavor list" command in loop and checking if the response of the
command request is returned with no failure.
When the response fails, the related service is considered in outage.


Methodology for stressing disk I/O
''''''''''''''''''''''''''''''''''

To evaluate the high availability of target OpenStack service under heavy I/O
load, the test case will execute shell command on the selected controller node
to continuously writing 8kb blocks to /test.dbf

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and then stress disk I/O by
  continuously writing 8kb blocks to /test.dbf
* Test action 2: Start a monitor: for "openstack flavor list" command
* Test action 3: Create a flavor called "test-001"
* Test action 4: Check whether the flavor "test-001" is created
* Test action 5: Continuously measure service outage time from the monitor
  until the service outage time is more than 5s
* Test action 6: Stop writing to /test.dbf and delete file /test.dbf

Pass / fail criteria
''''''''''''''''''''

The service outage time is less than 5s.

The nova operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Delete flavor with "openstack flavor delete test-001".

--------------------------------------------------------------------
Test Case 8 - Controller Load Balance as a Service High Availability
--------------------------------------------------------------------

Short name
----------

yardstick.ha.haproxy

Yardstick test case: opnfv_yardstick_tc053.yaml

Use case specification
----------------------

This test verifies the high availability of "haproxy" service. When
the "haproxy" service of a specified controller node is killed, whether
"haproxy" service on other controller nodes will work, and whether the
controller node will restart the "haproxy" service are checked. This
test case kills the processes of "haproxy" service on the selected
controller node, then checks whether the request of the related OpenStack
command is processed with no failure and whether the killed processes are
recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the "haproxy"
service for rest-api.

Denoted as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for monitoring high availability
''''''''''''''''''''''''''''''''''''''''''''

The high availability of "haproxy" service is evaluated by monitoring
service outage time and process outage time

Service outage time is tested by continuously executing "openstack image list"
command in loop and checking if the response of the command request is returned
with no failure.
When the response fails, the "haproxy" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of processes of "haproxy"
service on the selected controller node. The time of those processes
being killed to the time of those processes being recovered is the process
outage time.
Process recovery is verified by checking the existence of processes of "haproxy" service.

Test execution
''''''''''''''

* Test action 1: Connect to Node1 through SSH, and check that processes of
  "haproxy" service are running on Node1
* Test action 2: Start two monitors: one for processes of "haproxy"
  service and the other for "openstack image list" command. Each monitor will
  run as an independent process
* Test action 3: Connect to Node1 through SSH, and then kill the processes of
  "haproxy" service
* Test action 4: Continuously measure service outage time from the monitor until
  the service outage time is more than 5s
* Test action 5: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

The process outage time is less than 30s.

The service outage time is less than 5s.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Restart the processes of "haproxy" if they are not running.

----------------------------------------------------------------
Test Case 9 - Controller node OpenStack service down - Database
----------------------------------------------------------------

Short name
----------

yardstick.ha.database

Yardstick test case: opnfv_yardstick_tc090.yaml

Use case specification
----------------------

This test case verifies that the high availability of the data base instances
used by OpenStack (mysql) on control node is working properly.
Specifically, this test case kills the processes of database service on a
selected control node, then checks whether the request of the related
OpenStack command is OK and the killed processes are recovered.

Test preconditions
------------------

In this test case, an attacker called "kill-process" is needed.
This attacker includes three parameters: fault_type, process_name and host.

The purpose of this attacker is to kill any process with a specific process
name which is run on the host node. In case that multiple processes use the
same name on the host node, all of them are going to be killed by this attacker.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for verifying service continuity and recovery
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''

In order to verify this service two different monitors are going to be used.

As first monitor is used a OpenStack command and acts as watcher for
database connection of different OpenStack components.

For second monitor is used a process monitor and the main purpose is to watch
whether the database processes on the host node are killed properly.

Therefore, in this test case, there are two metrics:

* service_outage_time, which indicates the maximum outage time (seconds)
  of the specified OpenStack command request
* process_recover_time, which indicates the maximum time (seconds) from the
  process being killed to recovered

Test execution
''''''''''''''
* Test action 1: Connect to Node1 through SSH, and check that "database"
  processes are running on Node1
* Test action 2: Start two monitors: one for "database" processes on the host
  node and the other for connection toward database from OpenStack
  components, verifying the results of openstack image list, openstack router list,
  openstack stack list and openstack volume list.
  Each monitor will run as an independent process
* Test action 3: Connect to Node1 through SSH, and then kill the "mysql"
  process(es)
* Test action 4: Stop monitors after a period of time specified by "waiting_time".
  The monitor info will be aggregated.
* Test action 5: Verify the SLA and set the verdict of the test case to pass or fail.


Pass / fail criteria
''''''''''''''''''''

Check whether the SLA is passed:
- The process outage time is less than 30s.
- The service outage time is less than 5s.

The database operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

The database service is up and running again.
If the database service did not recover successfully by itself,
the test explicitly restarts the database service.

------------------------------------------------------------------------
Test Case 10 - Controller Messaging Queue as a Service High Availability
------------------------------------------------------------------------

Short name
----------

yardstick.ha.rabbitmq

Yardstick test case: opnfv_yardstick_tc056.yaml

Use case specification
----------------------

This test case will verify the high availability of the messaging queue
service (RabbitMQ) that supports OpenStack on controller node. This
test case expects that message bus service implementation is RabbitMQ.
If the SUT uses a different message bus implementations, the Dovetail
configuration (pod.yaml) can be changed accordingly. When messaging
queue service (which is active) of a specified controller node
is killed, the test case will check whether messaging queue services
(which are standby) on other controller nodes will be switched active,
and whether the cluster manager on the attacked controller node will
restart the stopped messaging queue.

Test preconditions
------------------

There is more than one controller node, which is providing the "messaging queue"
service. Denoted as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for verifying service continuity and recovery
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''

The high availability of "messaging queue" service is evaluated by monitoring
service outage time and process outage time.

Service outage time is tested by continuously executing "openstack image list",
"openstack network list", "openstack volume list" and "openstack stack list"
commands in loop and checking if the responses of the command requests are
returned with no failure.
When the response fails, the "messaging queue" service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of processes of "messaging
queue" service on the selected controller node. The time of those processes
being killed to the time of those processes being recovered is the process
outage time.
Process recovery is verified by checking the existence of processes of
"messaging queue" service.

Test execution
''''''''''''''

* Test action 1:  Start five monitors: one for processes of "messaging queue"
  service and the others for "openstack image list", "openstack network list",
  "openstack stack list" and "openstack volume list" command. Each monitor
  will run as an independent process
* Test action 2: Connect to Node1 through SSH, and then kill all the processes of
  "messaging queue" service
* Test action 3: Continuously measure service outage time from the monitors until
  the service outage time is more than 5s
* Test action 4: Continuously measure process outage time from the monitor until
  the process outage time is more than 30s

Pass / fail criteria
''''''''''''''''''''

Test passes if the process outage time is no more than 30s and
the service outage time is no more than 5s.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------
Restart the processes of "messaging queue" if they are not running.

---------------------------------------------------------------------------
Test Case 11 - Controller node OpenStack service down - Controller Restart
---------------------------------------------------------------------------

Short name
----------

yardstick.ha.controller_restart

Yardstick test case: opnfv_yardstick_tc025.yaml

Use case specification
----------------------

This test case verifies that the high availability of controller node is working
properly.
Specifically, this test case shutdowns a specified controller node via IPMI,
then checks whether all services provided by the controller node are OK with
some monitor tools.

Test preconditions
------------------

In this test case, an attacker called "host-shutdown" is needed.
This attacker includes two parameters: fault_type and host.

The purpose of this attacker is to shutdown a controller and check whether the
services are handled by this controller are still working normally.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for verifying service continuity and recovery
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''

In order to verify this service one monitor is going to be used.

This monitor is using an OpenStack command and the respective command name of
the OpenStack component that we want to verify that the respective service is
still running normally.

In this test case, there is one metric: 1)service_outage_time: which indicates
the maximum outage time (seconds) of the specified OpenStack command request.

Test execution
''''''''''''''
* Test action 1: Connect to Node1 through SSH, and check that controller services
  are running normally
* Test action 2: Start monitors: each monitor will run as independently
  process, monitoring the image list, router list, stack list and volume list accordingly.
  The monitor info will be collected.
* Test action 3: Using the IPMI component, the Node1 is shut-down remotely.
* Test action 4: Stop monitors after a period of time specified by "waiting_time".
  The monitor info will be aggregated.
* Test action 5: Verify the SLA and set the verdict of the test case to pass or fail.


Pass / fail criteria
''''''''''''''''''''

Check whether the SLA is passed:
- The process outage time is less than 30s.
- The service outage time is less than 5s.

The controller operations are carried out in above order and no errors occur.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

The controller has been restarted

----------------------------------------------------------------------------
Test Case 12 - OpenStack Controller Virtual Router Service High Availability
----------------------------------------------------------------------------

Short name
----------

yardstick.ha.neutron_l3_agent

Yardstick test case: opnfv_yardstick_tc058.yaml

Use case specification
----------------------

This test case will verify the high availability of virtual routers(L3 agent)
on controller node. When a virtual router service on a specified controller
node is shut down, this test case will check whether the network of virtual
machines will be affected, and whether the attacked virtual router service
will be recovered.

Test preconditions
------------------

There is more than one controller node, which is providing the Neutron API
extension called "neutron-l3-agent" virtual router service API.

Denoted as Node1 in the following configuration.

Basic test flow execution description and pass/fail criteria
------------------------------------------------------------

Methodology for verifying service continuity and recovery
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''

The high availability of "neutrol-l3-agent" virtual router service is evaluated
by monitoring service outage time and process outage time.

Service outage is tested using ping to virtual machines. Ping tests that
the network routing of virtual machines is ok.
When the response fails, the virtual router service is considered in outage.
The time between the first response failure and the last response failure is
considered as service outage time.

Process outage time is tested by checking the status of processes of "neutron-l3-agent"
service on the selected controller node. The time of those processes being
killed to the time of those processes being recovered is the process outage time.

Process recovery is verified by checking the existence of processes of
"neutron-l3-agent" service.

Test execution
''''''''''''''
* Test action 1: Two host VMs are booted, these two hosts are in two different
  networks, the networks are connected by a virtual router.
* Test action 2: Start monitors: each monitor will run with independently process.
  The monitor info will be collected.
* Test action 3: Do attacker: Connect the host through SSH, and then execute the kill
  process script with param value specified by “process_name”
* Test action 4: Stop monitors after a period of time specified by “waiting_time”
  The monitor info will be aggregated.
* Test action 5: Verify the SLA and set the verdict of the test case to pass or fail.

Pass / fail criteria
''''''''''''''''''''

Check whether the SLA is passed:
- The process outage time is less than 30s.
- The service outage time is less than 5s.

A negative result will be generated if the above is not met in completion.

Post conditions
---------------

Delete image with "openstack image delete neutron-l3-agent_ha_image".

Delete flavor with "openstack flavor delete neutron-l3-agent_ha_flavor".