Requirement =========== 1. When deploying a large OPNFV/OpenStack cluster, we would like to take the advantage of UDP multicast to prevent the network bottleneck when distributing Kolla container from one Installer Server to all target hosts by using unicast. 2. When it comes to auto scaling (extension) of compute nodes, use unicast is acceptable, since the number of nodes in this condition is usually small. The basic step to introduce multicast to deployment is: a. Still setup the monopolistic docker registry server on Daisy server as a failsafe. b. Daisy server, as the multicast server, prepares the image file to be transmitted, and count how many target hosts(as the multicast clients)that should receive the image file simultaneously. c. Multicast clients tell the multicast server about ready to receive the image. d. Multicast server transmits image over UDP multicast channel. e. Multicast clients report success after received the whole image. f. Setup docker registry server on each target hosts based upon received docker image. g. Setup Kolla ansible to use 127.0.0.1 as the registry server IP so that the real docker container retrieving network activities only take place inside target hosts. Design ====== Methods to achieve ------------------ TIPC ++++ TIPC or its wrapper such as ZeroMQ is good at multicast, but it is not suitable as an installer: 1. The default TIPC kernel module equipped by CentOS7(kernel verison 3.10) is NOT stable especially in L3 multicast(although we can use L2 multicast, but the network will be limited to L2). If errors happen, it is hard for us to recover a node from kernel panic. 2. TIPC's design is based on a stable node cluster environment, esp in Lossless Ethernet. But the real environment is generally not in that case. When multicast is broken, Installer should switch to unicast, but TIPC currently do not have such capability. Top level design ---------------- 1. There are two kinds of thread on the server side, one is UDP multicast thread the other is TCP sync/retransmit thread. There will be more than one TCP threads since one TCP thread can only serve a limited client (say 64~128) in order to limit the CPU load and unicast retransmit network usage. 2. There is only one thread on client side. 3. All the packets that a client lost during UDP multicast will be request by client to the TCP thread and resend by using TCP unicast, if unicast still cannot deliver the packets successfully, the client will failback to using the monopolistic docker registry server on Daisy server as a failsafe option. 4. Each packet needs checksum. UDP Server Design (runs on Daisy Server) ---------------------------------------- 1. Multicast group IP and Port should be configurable, as well as the interface that will be used as the egress of the multicast packets. The user will pass the interface's IP as the handle to find the egress. 2. Image data to be sent is passed to server through stdin. 3. Consider the size of image is large (xGB), the server cannot pre-allocate whole buffer to hold all image at once. Besides, since the data is from stdin and the actual length is unpredictable. So the server should split the data into small size buffers and send to the clients one by one. Furthermore, buffer shall be divided into packets which size is MTU including the UDP/IP header. Then the buffer size can be , for example 1024 * MTU including the UDP/IP header. 4. After sending one buffer to client the server should stop and get feedback from client to see if all clients have got all packets in that buffer. If any clients lost any buffer, client should request the server to resend packets from a more stable way(TCP). 5. when got the EOF from stdin, server should send a buffer which size is 0 as an EOF signal to the client to let it know about the end of sending. TCP Server Design (runs on Daisy Server) ---------------------------------------- 1. All TCP server threads and the only one UDP thread share one process. The UDP thread is the parent thread, and the first TCP thread is the child, while the second TCP thread is the grandchild, and so on. Thus, for each TCP thread, there is only one parent and at most one child. 2. TCP thread accepts the connect request from client. The number of client is predefined by server cmdline parameter. Each TCP thread connect with at most ,say 64 clients, if there are more clients to be connected to, then a child TCP thread is spawned by the parent. 3. Before UDP thread sending any buffer to client, all TCP threads should send UDP multicast IP/Port information to their clients beforehand. 4. During each buffer sending cycle, TCP threads send a special protocol message to tell clients about the size/id of the buffer and id of each packet in it. After getting acknowledgements from all clients, TCP threads then signal the UDP thread to start multicasting buffer over UDP. After multicasting finished, TCP threads notifies clients multicast is done, and wait acknowledgements from clients again. If clients requests retransmission, then it is the responsibility of TCP threads to resend packets over unicast. If no retransmission needed, then clients should signal TCP threads that they are ready for the next buffer to come. 5. Repeat step 4 if buffer size is not 0 in the last round, otherwise, TCP server shutdown connection and exit. Server cmdline usage example ---------------------------- ./server [port] < kolla_image.tgz is used here to specify the multicast egress interface. But which interface will be used by TCP is leaved to route table to decide. indicates the number of clients , thus the number of target hosts which need to receive the image. [port] is the port that will be used by both UDP and TCP. Default value can be used if user does not provide it. Client Design(Target Host side) -------------------------------- 1. Each target hosts has only one client process. 2. Client connect to TCP server according to the cmdline parameters right after start up. 3. After connecting to TCP server, client first read from TCP server the multicast group information which can be used to create the multicast receive socket then. 4. During each buffer receiving cycle, the client first read from TCP server the buffer info, prepare the receive buffer, and acknowledge the TCP server that it is ready to receive. Then, client receive buffer from the multicast socket until TCP server notifying the end of multicast. By compare the buffer info and the received packets, the client knows whether to send the retransmission request or not and whether to wait retransmission packet or not. After all packets are received from UDP/TCP, the client eventually flush buffer to stdout and tells the TCP server about ready to receive the next buffer. 5. Repeat step 4 if buffer size is not 0 in the last round, otherwise, client shutdowns connection and exit. Client cmdline usage example ---------------------------- ./client [port] > kolla_image.tgz is used here to specify the multicast ingress interface. But which interface will be used by TCP is leaved to route table to decide. indicates the TCP server IP to be connected to. [port] is the port that will be used by both connect to TCP server and receive multicast data. Collaboration diagram among UDP Server, TCP Server(illustrate only one TCP thread) and Clients: UDP Server TCP Server Client | | | init mcast group init mcast send socket ----------------------------------> accept clients <------------------------connet------------------ --------------------send mcast group info-------> <---------------------------------- state = PREP do { read data from stdin prepare one buffer -----------------------------------> state = SYNC -------------------send buffer info--------------> <----------------------send ClIENT_READY----------- <---------------------------------- state = SEND ================================================send buffer over UDP multicast======> -----------------------------------> -----------------------send SERVER_SENT-----------> [<-------------------send CLIENT_REQUEST----------] [--------------send buffer over TCP unicast------>] flush buffer to stdout <-------------------send CLIENT_DONE--------------- <---------------------------------- state = PREP while (buffer.len != 0)