TC7.5 - Developing Data Analytics Processes

×

Warning message

You can't delete this newsletter because it has not been sent to all its subscribers.

Test Case Title

TC7.5 - Developing Data Analytics Processes

Goal

As AreaManager or higher level user, I can:

Develop and/or modifying data analytics processes

Develop a new Data Analytics Processes in R, Java, Python, etc. Using: external services, direct access to data store, Advanced Smart City API. This can be performed: (i) using the VM provided, downloading it, putting in execution and developing; (ii) accessing via web to R Studio.

Derive the correlations among data, activating on the collected data sets a set of algorithms that may identify correlations, anomalies, etc. and thus performing auto tuning among the families of machine learning and statistic algorithm to arrive at the application of the those that provide the best performance in term of precision, AIC, ELBO, etc… according to the methods and context.

Analyse data for correlation, etc.

Get results from the process executed.

Prerequisites

Using a PC or Mobile with a web browser. Conquer a minimal skill on producing R programme. Get the R example provided. Upload, modify and run the example. From the R studio remote access with credentials. From the R studio it is possible to use direct access to Data Store, and or to use the Smart City API.

Create the final package and upload on the ProcessLoader for execution.

The following functionalities are available only for specific Snap4city users with specific privileges.

Expected successful result

Collect data from the Data Stores, more than one data set. Perform the correlation, creating correlation matrices, estimate the descriptive statistics, produce prediction based on ARIMA as AUTOARIMA for the best model identification, produce the graphics for data trends, comparing trends, etc.

Steps

 

 

Please note that to correctly perform this Test Case you need and access to the R Studio Virtual Machine as described below. To have access to a Virtual Machine to perform R Studio please contact snap4city@disit.org.

 

  1. Go to the R-Studio portal following the link: https://rstudio.snap4city.org

     
  2. Sign In on R-Studio

     
  3. Access on the directories inside R-Studio:
  • Click on the ‘Snap4City’ directory to access on the directory that contains the R scripts
  • Click on the ‘Snap4CityStatistics’ directory to visualize all the R scripts

     



Fig:  Directories on R-Studio

Please note that the above picture reports the interface clean since we left the condition cleaned. While if you enter in the account after your colleague you risk to find the status of the previous operations. In this latter case, please do what we reported ignoring the condition of the windows that you may find, and the solution will work anyway.

 

  1. Click on the ‘Function.R’ script to open it on the Source pane, on the up-right corner of the window



Fig: R Scripts inside the ‘Snap4CityStatistics’ directory.

 

 



Fig: ‘Function.R’ Script opened on the Source pane.

 

  1. Running of the R code lines - on the Code panel, on the top left of the window, is reported the performed steps:
  • Select the ‘STEP 1’ code lines and click on Run: with the STEP 1, all the required libraries is loaded inside R



Fig: ‘Function.R’ Script opened on the Source pane: STEP 1

 

  • Select the ‘STEP 2’ code lines and click on Run: with the STEP 2, a SPARQL query is executed to retrieve traffic flow data inside R

  • Select the ‘STEP 3’ code lines and click on Run: with the STEP 3, a SPARQL query is executed to retrieve car parks data inside R

  • Select the ‘STEP 4’ code lines and click on Run: with the STEP 4, the data retrieve before is integrated and joined in to a single dataset

  • Select the ‘STEP 5’ code lines and click on Run: with the STEP 5, all the statistical analysis and predictions is executed. Note that the analysis is completed when “STEP 5 COMPLETED - STATISTICAL ANALYSIS PERFORMED” is displayed on the Code panel

 

 



Fig: ‘Function.R’ Script opened on the Source pane: STEPS 2 to STEP 5

 



Fig: ‘ STEP 5 COMPLETED - STATISTICAL ANALYSIS PERFORMED’ message on the code pane

 

  1. Statistical Analysis Results visualization:
  • Click on ‘Snap4City’ to go back to the principal directory

               



Fig: From the ‘Snap4CityStatistics’ directory to the ‘Snap4City’ directory

 

  • Click on the ‘StatisticsOutput’ directory, inside the ‘Snap4City’ directory, to visualize statistical analysis results and trend graphs as .png files



Fig:  ‘StatisticsOutput’ directory with the statistical analysis results

 

  • Click on each .png file contained into the ‘StatisticsOutput’ directory, to visualize statistical analysis results and trend graphs: a new tab is opened for each file



Note that, into the ‘StatisticsOutput’ directory the .csv format are also contained but not reported in the figures f, g, h. It is possible export and save them, checking on the respective file’s box

 



Fig:  .png files visualization in a new tab

 

  1. Statistical Analysis Results saving and exporting:
  • Select all the .png files checking each file’s box
  • Click on ‘More’ and on ‘Export…’ to save all the statistics in .zip format: save the .zip on your pc

 

Fig:  .png files saving and exporting in .zip format

 

  1. Comments on Statistical Analysis Results
  • AverageSpeedDailyTrend.png - shows the daily trend for the average speed, measured for five different traffic sensors of different areas in the city of Florence. Trends are computed considering the day before the R script execution

 



Fig:  Average speed daily trend measured by five different sensors

 

  • VehicleFlowDailyTrend.png - shows the daily trend for the vehicle flow, measured for five different traffic sensors of different areas in the city of Florence (the same sensors considered for the average speed). Trends are computed considering the day before the R script execution

 



Fig:  Vehicle flow daily trend measured by five different sensors

  • CarParksDailyTrend.png - shows the daily trend for the number of free parking lots, measured for five different car parking garages of different areas in the city of Florence. Trends are computed considering the day before the R script execution



Fig:  Average speed daily trend measured by five different sensors

  • CorrelationMatrix.png – shows the correlations between each traffic sensor variable (related to the average speed and vehicle flow, measured every 10 minutes by five different sensors), and each car parking variable (related to the number of free slots inside five car parking garages, measured every 15 minutes). Note that, correlation near to 1 to -1 is a strong correlation (blue and red colours respectively).



Fig:  Correlation Matrix

 

  • SensorsMeanPerDayMoment.png – shows the mean for each sensor, computed for the morning (from 6:00 to 13:00), afternoon (from 14:00 to 18:00), evening (from 19:00 to 23), night (00:00 to 05:00)



Fig:  Table with the mean of each sensor, computed per day moment (afternoon, evening, morning, night)

 

  • StatisticsBySensors.png – shows the main statistics for each sensor, considering the measures related to 24 hours



Fig: Table with the main statistics computed for each sensor

 

  • StatisticsBySensors.png – shows the main statistics for each sensor, computed for the morning (from 6:00 to 13:00), afternoon (from 14:00 to 18:00), evening (from 19:00 to 23), night (00:00 to 05:00)



Fig:  Table with the main statistics computed for each sensor per day moment

 

  • PREDICTION ANALYSIS RESULTS:

PredictedFreeParkingLots.png - shows the results of predictions for a car parking garage (S. Maria. Novella Station in Florence) for one hour in advance, in slots of 15 minutes, starting from the moment of the R script execution. For example, today’s date is 2018-02-16 and the 12:25 is the last collected observation before the execution of the R script. The predictions computed by the model is those reported in the Fig. below.



Fig: Table with the predictions for the ‘S.M.N station’ car parking garage in Florence

The prediction model used to compute predictions is the statistical ARIMA (Auto-Regressive Moving Average) model. The Auto-Regressive part (AR) of the model creates the basis of the prediction, which can be improved by a Moving Average (MA) modelling for errors made in previous time instants of prediction. Note that, a two weeks period was considered for the forecast.

A specific function has been used to automatically decide the best ARIMA model for the time series to analyse. The results of the specific ARIMA model are printed on the code pane and saved into the ‘StatisticsOutput’ directory in a double format: ‘ARIMAcoefficients.png’ and ‘ARIMAcoefficients.csv’ .

The best ARIMA model is selected by using the minimum AIC rule.

Note that, to make predictions, the model can be replaced with a predictive machine learning model in order to automatically take into account the possible effects of other variables. 



Fig: Table with coefficients of the ARIMA model

  1. Datasets: all datasets used to compute statistical analysis and predictions, are saved into the ‘Sensors Data’ directory, contained in the ‘Snap4City’ (see the fig. below).

    The subdirectories ‘CarParkCSVFiles’ and ‘TrafficCSVFiles’ contain the data per single sensor, collected during a period of 24 hours.

    ‘TrafficSensorsDataset.csv’ and ‘CarParkDataset.csv’ are the datasets related to all traffic flow sensors and all car parks sensors respectively.

    The integrated dataset that contains both traffic sensors and car parks data is ‘SensorsDataset.csv’ and is used to compute the statistical analysis.
  • CarParksSMN2weeks.csv’ is the dataset used to compute the predictions.

 



Fig:  Sensors Data directory

  • Comparing by using statistic tools. Once the data are selected the system can save the query to perform periodically or on demand. All the queries performed on Developer Dashboard provide a compatible output. The outputs can be composed and sent to a preformed R tool to perform a statistical analysis including: descriptive statistics PCA, correlation, automated identification of the most correlated trend. Similarly, other tools can be developed to perform daily, weekly, monthly analysis, predictions, anomaly detection, etc. Please see the example located into