TC10.13 - Back office Cluster Container Scalability and Monitoring Scalable Infrastructure with Elastic Management of applications and processes.

Test Case Title

TC10.13 - Back office Cluster Container Scalability and Monitoring Scalable Infrastructure with Elastic Management of applications and processes.

Goal

Manage resource allocation for VM and Containers: CPU, Storage, Memory

Manage all components of the Snap4City solution in terms of VM, Host and containers.

Managing Container in the back office

Monitor when the system identifies that a scale-up or down is needed, and its application in terms of VM (allocation, dispose), Container (allocation, dispose), compact. The scaling is quasi linear with the number of users/container.

Prerequisites

Using a PC or Mobile with a web browser. Snap4City parallel, distributed robust, scalable architecture is installed up and running. DISCES + EM is active and operative. Resource monitoring is operative.

The following functionalities are available only for specific Snap4city users with specific privileges.

Expected successful result

Snap4City scalability allow to manage dynamically resources for scaling up and down according to the workload in terms of actions to be managed and Containers to be satisfied.

Steps

 

 

Please note that some of the following links could be accessible only for registered users.

From the technical point of view, the solution can pass from small to large cities very easily, remaining efficient and convenient for both cases, according to the number of users, processes, and functionalities activated. The solution can be also hosted on multiple Datacenters to allow covering much larger solutions than simple one large city or region. Snap4City platform is able to scale to big cities like London or Berlin, and is capable to be deployed for small cities. The solution is modular and scalable as it is scalable Km4City as demonstrated in [FGCS], covering the whole Tuscany Region right now. From the technical point of view the solution can pass from small to large for the presence of distributed and scalable solutions in the backend (Marathon, Mesos, Docker, HDFS, DISCES, ...).  The scalability is demonstrated in the following pages by assessing the scalability of IOT Applications with their integration with Marathon/Mesos/Docker in the context of IOT processes.

The Snap4City architecture has been described in TC10.2.

Snap4City solution is in HA/DRS/FT and supports both vertical and horizontal scaling as described in  TC10.3

The Snap4City solution allows:

  • Monitor the resource status on cloud: Hosts and VM via the VSphere, also via ResDash;
  • Monitor the resource consumption and status of Containers via Marathon and Docker API. Please remind that Containers may include: IOT Applications in NodeRED with their corresponding user interface and editor, ETL processes, Data Analytics processes in R Studio for example, or Java, Python, etc.:
    • Verifying their resource consumption, and eventually increase memory and CPU of the VM, while the VM are working in overbooking and the movement of a VM into a different host with more resources is performed at level of HA/DRS on the cloud manager;
    • Assess their status: healthy and unhealthy as usually monitored, and control how many containers are in delaying, waiting, running, etc.
  • Scheduling and planning for periodic processes, such as those managing ETL data ingestion from city operators and external services. A trade off can be identified in delegating repetitive processing to batch processes thus avoiding their repeated execution on NodeRED. AMMA tool can identify which parts of the IOT Applications can be more profitably executed on batch processing. Batch processing can be performed by scheduled ETL and/or NodeRED.
  • Control the resource status on cloud: Hosts and VM via the VSphere API, acting on them for Allocation/deallocating of: VMs on Hosts;
    • VM: creation from templates, boot and shutdown, cloning and deploying, migration, taking snapshots, etc., and also turn on/off network connections, Horizontal Scaling;
    • VM: changing the limit for Memory and CPU resources, Vertical Scaling;
    • Control the Containers on VMs: move them on a different VM, kill them and restart if they are unhealthy, and/or provide them more resources to solve their condition. When a container is reallocated a different size of memory and CPU can be assigned.
  • Elastic management on the basis of the processes that are more or less active H24 as IOT applications but in reality, they are active only one demand of the user, or data driven. So that, they are almost always allocated but the real consumption of CPU resource is sporadic, for most of them, therefore a certain overbooking can be performed, for example of the 15%. When the requests arrive, they have to be ready to react. This also means that a strong optimization of resource is needed to avoid performing static allocation that would be too expensive for the provider/city.
     

     
    Assessment of the real exploitation of resources and adapting the workload by VM/Host, balancing and managing the elasticity in terms of number of back office Hosts/VMs active. We have large experience in smart city and cloud performances in exploiting the resources as assessed in paper [FGCS], [ICARO] and [Cloud]. To show the scalability of ETL, IOT Applications and DataAnalytics from few instances to a large number of them have been realized with a quasi linear increment of costs, and without technical problems. For example passing from 5000 processes per day, to 50.000 and 500.000,... See for example publication [FGCS] in which the performance assessment of Km4City smart city solution has been conducted, and on which the proposed solution is based. Please note that the DISCES and elastic scheduler have direct access to resource consumption of: Hosts, VM and containers.
    This means that on the back office we have a linear scalability with respect to the number of Nodes/Hosts for managing NodeRED Applications deployed as containers. Elastic computing is performed at level of containers and at level of VM on cloud. Balance is performed at level of containers and on the front end. Direct tests have demonstrated that an commercial host on cloud can hosts hundreds/thousands of Containers (for example 400 processes) with NodeRED and they can manage at least 120in+120out msg per second, thus realizing about 100.000 events (one MQTT and one Rest Call per process) per second with a linear scalability provided by Mesos/Marathon, balancer and elastic management developed by Snap4City team.
     

New Monitoring Cluster

The new interface for monitoring the Cluster can be accessed by:

Main: Management --> Container Cluster Monitoring

Also accessible as:

In the user interface of the Container Cluster Monitoring status we have:

  • #Containers: a gauge reporting the number of Containers/IOT Applications
  • #Tasks: a gauge reporting the number of Containers/IOT Applications which are managed by Marathon. When a Container is moved, temporary two instances are allocated so that the #Tasks can be higher than #Containers
  • #Healthy: number of Containers that are in the Healthy status
  • #Unhealthy: number of Containers that are in the Unhealthy status
  • CPU Trend: the value of total CPU computer in Ghz of the cluster
  • #VMs up: number of VMs which have been requested to be running from the scaling manager
  • #VMs running: number of VMs which are up and running for the container manager
  • CPU: mean amount of CPU consumed in the last minutes in terms of GHz
  • Memory: mean amount of Memory consumed in the last minutes in terms of GByte
  • Memory Trend: the value of total Memory used by the VM activated in the cluster, in Gbytes
  • Memory usage % graph: a time trend of the percentage of Memory Usage for each VM that has been activated. The list is dynamic, the ID of the VM correspond to the IP on the 1.X cluster.
  • CPU usage % graph: a time trend of the percentage of CPU Usage for each VM that has been activated. The list is dynamic, the ID of the VM correspond to the IP on the 1.X cluster.
  • #Containers graph: is the time trend of the #Container: Total, Healthy and Unhealthy
  • The two buttons: RESTE CPU/MEM USAGE and RESET TASKS are used only for resetting the graphics.  

This view allows to monitor the status of the Cluster for Container and may inform you if some critical situations occur. The notifications may reach you via email, Telegram, SMS, Twitter, etc.

In the status, presented in the figure, we have 850 containers, for about 77 Ghz and 86 Gbyte of memory. These data have been collected synchronously with the data reported in the next figure regarding the status of  the Vm and the traffic of data among Containers, for more than 1 million of messages in input per minute and 1 million of messages in output.  

Please noted that, according to the graph reported in the bottom right corner, the system presented three situations in which a small number of unhealthy Container appeared. Those critical cases, have been provoked by us to stress the reaction to failure and the scalability, in terms of allocating/deallocating VM when needed. The evidence and the test cases continues in the next section.


New solution for Stressing Testing Cluster

In addition, the system can be stressed for testing https://www.snap4city.org/scalabilitytest2/ui/#/1 by:

  • adding/removing a number of Applications/Containers,
  • generating a number of MQTT messages to them
  • and observing the status of additional values about those Containers/IOT Application and in particular:
    • #msg received/min: the number of messages received in the last minute
    • #msg sent/min: the number of messages sent in the last minute
    • #msg in output/input: the number of messages in output in the last minute
    • Total MSG: total number of MSG since the beginning of the experiment from the last restart of the counting
    • Reset Total: is resetting only the testing processes
    • Rest Chart: is resetting on the graphs and view of this panel
    • Start 100 Test Apps: it starts 100 Testing Containers with IOT Applications
    • #Apps: reports the number of Containers in total
    • Stop 100 Test Apps: it stops 100 Testing Containers with IOT Applications, among those put in execution for testing only, without damaging those created by the users.
    • Start MQTT MSG: start about 44.000 Message MQTT per minute toward the IOT Application allocated per testing in that moment, when we have 800 Containers
    • Start MQTT MSG/2: start about 22.000 Message MQTT per minute toward the IOT Application allocated per testing in that moment, when we have 800 Containers
    • #mqtt gens: number of MQTT messages generated
    • Stop last MQTT Gen: stop the last process activated for sending messages
    • Stop all MQTT MSGS: stop all the sending processes for sending messages
  • In addition, the graph reports several graphs for monitoring the status.
    • #VMs Graph: the trend about the number of VM active over time
    • #mqtt msgs: the trend about the number of MQTT messages over time: sent, received
    • Tasks per VM Graph: the trend about the number of VM active over time
    • Healthy Containers per VM Graph: the trend about the number of Healthy Containers per VM over time
    • Unhealthy Containers per VM Graph: the trend about the number of Unhealthy Containers per VM over time
    • Docker Containers per VM Graph: the trend about the number of Docker Containers per VM over time 

The user can stress and test the system by activating a number of IOT applications and sending to them a huge number of messages at a high rate.

Moreover, in the above figure, the stable condition of about 950 containers/IOTApp on the cluster is presented. They also exchange 1.0 million of messages per minute, in reality 2 Million since receive and sent the same msg, with 15 VM.

In order to demonstrate the capability of elastic management we have perturbed the system by disconnecting some VM over time. In reality, in the above figure, it has been done three times. See the last graph of the figure. In which, at the corresponding disconnection a number of Container become unhealthy and the system has to recover.

Thus according to the above figure, the last two cases have been highlighted. As soon as the some of the containers become unhealthy the scaling solution compensates by allocating more resource, which in this case corresponded to deploy and boot a new VM, passing from 15 to 16. This allowed recovering the healthiness of the VMs, as depicted in the last graph. And  thus the number of VM is reported to original value of 15 VM as depicted in the #VM graph.  

The consumption of Resource on Cloud can be also monitored by using https://resdash.snap4city.org which maintain the history of all measures.

In the following figure, the status of the VCenter cloud in the same time frame is reported to show the condition of the  VM an  give the evidence that the VM are effectively turned off and on, when needed.


Script of the Cluster Monitoring Applications

You can access to the new scripts via: https://www.snap4city.org/scalabilitytest2

Former Scalability on back office execution of Applications

In order to demonstrate the above functionalities, we have set up the real prototype in which:

  • IOT applications can be requested by each single user. They are automatically deployed and put into a Cluster of VM with Containers Docker, managed by Marathon/Mesos
  • IOT Testing Applications can be instantiated at chunk of 100 by the administrator to test the solution and both the requested Containers and those for testing are allocated in the same cluster
  • IOT Testing Application can be removed performing un -100 from the allocation tool, and this process is respecting those requested by the users
  • MQTT messages can be sent to the IOT Testing Applications, from few message per minute to millions of messages in the platform. This generate high workload on the IOT Testing Applications,, that are running together with those for the Users.

This approach allows us to test the robustness and the scalability of the solution.

Each VM for the cluster executing containers has 17Gbyte Ram and 6 CPUs. A scalable number of these VM are ready to be used into the Cloud, from 1 to 20. VMs are all generated by the same template.

The previous experiment, now not accessible, has demonstrated the quasi linear scability.

 

This approach allows us to create the workload of a number of VM  and reach the limit of resource consumption for them in terms of CPU and Memory, thus activating the DISCES + Elastic Management that is running in the 130 for deciding to turn on/off the VM.