TC6.8 - ETL processes for data transformation, and exploiting MicroServices/API/RestCall

Primary tabs

Test Case Title

TC6.8 - ETL processes for data transformation, and exploiting MicroServices/API/RestCall

Goal

Snap4City, provides support for real time analytics which is performed by using a set of services. One of these is the possibility to:

Create ETL processes capable to make transformations on data.

Get data from SigFOX gateway.

Ingesting data sets with real time data, connecting them with Knowledge Base.

Prerequisites

Using a PC or Mobile with a web browser

The following functionalities are available only for specific Snap4city users with specific privileges.

Expected successful result

Create an ETL process for data transformation directly in the snap4city context

Steps

 

 

Please note that to correctly perform this Test Case you need and access to the ETL Virtual Machine as described below. To have access to a Virtual Machine to perform ETL please contact snap4city@disit.org.

 

Users roles for which this test case is available: Area Manager, ToolAdmin, RootAdmin
Users roles for which the test case is not available: Manager
User used to describe the Example: snap4city

Examples:

  • Login on the snap4city platform
  • Open the VM from the snap4city home page
  • Create the ETL processes DIRECTLY from the Virtual Machine in the snap4city portal

Users roles for which this test case is available: All
User used to describe this Examples: ‘snap4city’ (role: AreaManager)

  • Go to the snap4city home page 
  • Make the login
  • I can see the snap4city home page
  • Now you can select one of the functionalities on the left panel
    • Note that the list of the functionalities depends on the user role (and on the related permissions).

 


Example 1: Open the VM from the snap4city home page

Users roles for which this test case is available: AreaManager, ToolAdmin, RootAdmin
User used to describe this Examples: ‘snap4city’ (role: AreaManager

  • Open the VM from the snap4city home page (Example 1)
  • Go to the snap4city home page 
  • Make the login
  • If the user has the developer permissions (as provided for the ‘AreaManager’), he/she finds the menu ‘Development Tools > ETL development’
  • Click on the menu and insert the correct password (the same credentials two times)
     

    Fig.: vnc connection.

     

    Fig.: Virtual machine connection.

 

 


Example 2: Create the ETL processes DIRECTLY from the Virtual Machine in the snap4city portal

Users roles for which this test case is available: AreaManager, ToolAdmin, RootAdmin
User used to describe this Examples: ‘snap4city’ (role: AreaManager) 

  • Make the login on the snap4city portal and open the Virtual Machine (Example 1)
  • To create an ETL process, it is necessary the use of the Pentaho Kettle tool Spoon (http://www.pentaho.com) installed on the Virtual Machine available in the snap4city portal

ETL creation, follow these steps:

  • Locate a dataset of interest, that can be static (e.g., the ‘http://www.nk.hel.fi/avoindata/avoin_data-avustukset.xls’) or real-time, then you can create the ETL (Extract Transform and Load) process to ingest the data.
  • Open Spoon via a Terminal emulator (search the Icon  at the bottom of the screen or click on ‘Application’ menu > ‘Terminal Emulator’), then write the command ‘spoon.sh’ and click ‘enter’. 


Fig. –Virtual Machine (VM) Terminal Emulator (command line).

 

  • The Spoon Interface will open, now it is possible to create/load ETL processes.


Fig.: Spoon Interface.

 

  • To create a new ETL make a double click on the tab ‘jobs’ (or a right click and select ‘New’) in the left menu.


Fig.: Spoon search the menu for a new job.

 

  • Then a new job is created


Fig.: Spoon new job.

 

  • Now you can create your ETL connecting the steps (or blocks) available in the left menu (the connection is realized with the Hops or links).
    • Kettle is based on two key concepts (from operating point of view):
      • Job (with extension “.kjb”);
      • Transformation (with extension “.ktr”), composed of several steps.

In kettle the data are rows flow from one step to another one. A job manages the sequential execution of lower-level entities: transformations or other jobs.

  • Transformation and job can be realized using the steps, that are classified into three main categories:
    • Input: process some kind of 'raw' resource (file, database query or system variables) and create an output stream of records from it.
    • Output: (the reverse of input steps): accept records, and store them in some external resource (file, database table, etc.).
    • Transforming: process input streams and perform particular actions on them (adding new fields/new records); these actions produce one or more output streams.


Fig.: some available Spoon steps.

 

  • To put the step in a job or in a Transformation, the drag and drop functionality is active. The connection is done thank to the hops (link)
  • The first step is the ‘Start’ and usually the last is ‘Success’


Fig.: Spoon job and first step: ‘Start’. 

 


Fig.: Spoon job and first step: Job, Transformation, hops. 

  • To save a job or a transformation;
    • Right click on the related icon
    • Change the job/transformation name
    • Select the directory into which the file is saved (is an xml file)
    • The select the file name 


Fig.: Spoon job saving. 

  • To connect to a web server a Spoon step is present:
    • In a job, it is present in the menu ‘file management > HTTP’ or can be retrieved using the text search filter ‘Entries’. 


Fig.: Spoon HTTP step in a job.

 

  • In a Transformation, it is present in the menu ‘Lookup > HTTP Client’ (or HTTP Post), or can be retrieved using the text search filter ‘Entries’.


Fig.: Spoon HTTP step in a transformation.

SAMPLE: Helsinki_youth_subsidies (Static: the results are always the same or the same for a long period)

  • A set of samples are already present in the VM, for example we can choose the ‘Helsinki_youth_subsidies’ that is capable to: i) download a dataset; ii) transform it; iii) save the new data in a file (into the VM file system). To verify the ETL functionalities follow the steps:
    • Click on the menu and select: ‘File > Open’ and select the file from: ‘Ubuntu/Desktop/snap4cityETL/Helsinki_youth_subsidies_XLS/Ingestion/Maink.kjb’
    • In the Spoon Interface the Main Spoon Job appears.

Fig.: Spoon Interface: open the ‘Helsinki_youth_subsidies_XLS’ ETL.

 

Fig.: Spoon Interface:  the ‘Helsinki_youth_subsidies_XLS’ ETL opened. 

  • To run the ETL process, click on the main ‘Run’ button (), a popup will open with a set of default values (including the url from which to download the dataset: ‘’), click on the Second ‘Run’ button.


Fig.: Spoon Interface: launch an ETL (‘Run button’). 
 

  • Once launched the process, a set of windows will open. One relevant is the ‘Logging window’ from which it is possible to read the logs. When the logging windows report the sentence ‘‘Spoon – job has ended’, the job is ended with no errors.


Fig.:  Spoon Interface: ETL executes, logging details.

 

  • Open the ‘HTTP’ step to view how the connection with the web server it is realized:
    • Click on the ‘HTTP’ Icon (right click) and select ‘edit’ 


Fig.:  ETL execution: edit the HTTP step.

  • In the URL field is present the direct url to download the dataset (or a parameter that can come from the job settings, from the Database, from a configuration file, etc.)
  • In the Target file field is present the file path into which the data coming from the web server are written


Fig.:  ETL execution: HTTP step details.

  • What are the actions done by this process and how it is possible to verify them? Some block, can be expanded, with double left click or 1 right click of the mouse.
    • Activities:
      • block ‘Start’: start the process (see its details with: ‘right click > edit’),
      • block ‘Create original data folder’:
        • if not exists, creates a new folder for the downloaded file: ‘‘/home/ubuntu/Desktop/ETL_Output/Helsinki_youth_subsidies_XLS/Original_data’
      • block ‘HTTP’: makes the HTTP REST request (to ‘http://www.nk.hel.fi/avoindata/avoin_data-avustukset.xls’) and saves the xls file (named ‘avoin_data-avustukset.xls’) in the folder created in the previous block
      • block ‘File Exists’: checks if the uploaded file is already present
      • block ‘Create Elaborated data folder’: ‘‘/home/ubuntu/Desktop/ETL_Output/Helsinki_youth_subsidies_XLS/Elaborated_data’’
      • block ‘DUMMY’: do nothing
      • block ‘process data’: writes the downloaded file, made data analysis and save the result in a new file in the folder ‘‘/home/ubuntu/Desktop/ETL_Output/Helsinki_youth_subsidies_XLS/Elaborated_data/Result_data.csv’
      • block ‘Success’: ends the process.
    • How to verify:
      • Search the folder into the file system and verify the presence of the files, you can also delete them, and launch again the ETL process.