TC6.3 - Creating ETL processes for automated data ingestion and data transformation

×

Warning message

  • You can't delete this newsletter because it has not been sent to all its subscribers.
  • You can't delete this newsletter because it has not been sent to all its subscribers.

Test Case Title

TC6.3 - Creating ETL processes for automated data ingestion and data transformation

Goal

Collect, enrich and integrate referral and/or real-time data coming from external services via Data transformation processes in ETL

Access to a large library of ETL process for smart city, from ProcessLoader or GitHUB

Modify ETL processes for data transformation

Develop a new ETL processes for data transformation with Penthao Kettle. Using: external services, direct access to data store, Advanced Smart City API. This can be performed: (i) using the VM provided, downloading it, putting in execution and developing; (ii) accessing with VNC to a ready to use VM for direct development

Put them in execution, on demand or periodically, on the back office via DISCES

Getting results from the process executed.

Prerequisites

Using a PC or Mobile with a web browser. Conquer a minimal skill on producing ETL processes. See provided user manuals (Pentaho Kettle tool).

The following functionalities are available only for specific Snap4city users with specific privileges.

Expected successful result

Create an ETL process for data transformation. Put it in execution and monitor the execution remotely, etc. Access to data set data from ETL.

Publishing produced ETL on data storage and on DataGate automatically.

Steps

 

 

Please note that to correctly perform this Test Case you need and access to the ETL Virtual Machine as described below. To have access to a Virtual Machine to perform ETL please contact snap4city@disit.org.

Users roles for which this test case is available: Area Manager, ToolAdmin, RootAdmin

Users roles for which the test case is not available: Manager

User used to describe the Example: ‘snap4city’

Data processes developed in ETL or with other data analytics tools (R, Java, Phython, etc.) can be used to automatically:

  • Collect data from external sources, including external services of third parties.
  • Perform data transformation and analytics
  • Produce data to external location.

The referential data could be produced from:

  • external sources via ETL, for example: traffic flows, parking status, etc.;
  • real-time data arriving at the infrastructure, via IOT Brokers for example, sensors, actuators, etc.;
  • results from Data Analytics processes in: ETL, C/C++, java, Python, etc.;
  • collective traces from mobile applications, via Smart City API
  • crowd sourcing from the city users: votes, ranks, images, etc.;
  • social media data: for example: Facebook, tweet.com;
  • events in the city: entertainment, traffic, accidents, etc.;
  • Knowledge Base of the smart city;
  • DataGate: producing data from those collected;
  • Advanced Smart City API: producing data from mobile usage;
  • results from Snap4City Applications;

Log collected by the EventLogger
 
See the following Video to learn on how to work on the development environment

 


ETL process builder, creation of ETL, execution and debug

  • VIDEO: 05-ETL-process-builder-VM.mp4, in Google Drive and FTP
  • Open the Data Processing Development Environment into the VM
  • Open the ETL process editor as Penthao Kettle
    • Load an ETL process
    • Execute the ETL, see the resulting data
  • Open the DISCES Scheduler,
    • load a new process, and set execution to periodic
  • Open the R stat tool

 


Example 1: To collect, enrich and integrate referral and/or real-time data coming from external services via Data transformation processes in ETL,

  • To use the Virtual Machine to develop ETLs. Two methodologies can be adopted (using the credentials reported in Section 3.4 Development Environment):
    • CASE 1) Open the VM from the snap4city home page 
    • CASE 2) Open the VM directly from Teamviewer (deprecated)

 


CASE 1) Open the VM from the snap4city home page

  • Go to the snap4city home page 
  • Make the login
  • If you have the developer permissions, you finds the menu: ‘Development Tools > ETL development’
  • Click on the menu and insert the correct password (the same credentials two times, Fig a1, b)

Fig.a1: vnc connection.

 

Fig.b: Virtual machine connection.

For the snap4city public users or for the snap4city users with the role Manager, the Development tools are not available. For example, if a Manager (user name: ‘finaluser1’) goes to the snap4city home page, he/she cannot view the ‘Development’ menu. In fact the Managers actually are users with no permissions related to the development activites.

 

Fig.a2: snap4city home page for Manger and public users.

 


Following steps available both for CASE 1) and for CASE2)

  • Locate the dataset of interest, than it can be static (e.g., the ‘http://www.nk.hel.fi/avoindata/avoin_data-avustukset.xls ’) or real-time, then you can create the ETL (Extract Transform and Load) process to ingest the data. The ETL are realized thanks to Pentaho Kettle tool. It is possible to realize an ETL (i) using the VM provided, downloading it, putting in execution and developing; or (ii) accessing with VNC to a ready to use VM for direct development. Now it is descripted in detail the procedure ii) using the Virtual Machine (VM) via VNC.
  • Open a Terminal emulator (search the Icon  at the bottom of the screen or click on ‘Application’ menu > ‘Terminal Emulator’), then write the command ‘spoon.sh’ and click ‘enter’, Fig. d.

Fig. d –Virtual Machine (VM) Terminal Emulator (command line).

  • The Spoon Interface will open (Fig. e), now it is possible to create/load ETL processes.

Fig. e – Spoon Interface.

 

  • In the case of the dataset ‘Helsinki_youth_subsidies’ we have already created and ETL capable to: i) download the dataset; ii) transform it; iii) Save the new data in a file (into the VM). To verify the ETL functionalities follow the steps:
    • Click on the menu and select: ‘File > Open’ and select the file from: ‘Ubuntu/Desktop/snap4cityETL/Helsinki_youth_subsidies_XLS/Ingestion/Maink.kjb’, Fig. f
    • In the Spoon Interface the Main Spoon Job appears, Fig. g.

Fig. f – Spoon Interface: open the ‘Helsinki_youth_subsidies_XLS’ ETL.

 

Fig. g – Spoon Interface:  the ‘Helsinki_youth_subsidies_XLS’ ETL opened.

  • To run the ETL process, click on the main ‘Run’ button (), a popup will open with a set of default values (including the url from which to download the dataset: ‘’), click on the Second ‘Run’ button, Fig. h.

Fig. h – Spoon Interface: launch an ETL (‘Run button’).

 

  • Once launched the process, a set of windows will open. One relevant is the ‘Logging window’ from which it is possible to read the logs. When the logging windows report the sentence ‘‘Spoon – job has ended’, the job is ended with no errors, Fig. i.

 

Fig. i – Spoon Interface: ETL executes, logging details.

 

  • What are the actions done by this process and how it is possible to verify them? Some block visible in Fig. I, can be expanded, with double left click or 1 right click of the mouse.
    • Activities:
      • block ‘Start’: start the process (see its details with: ‘right click > edit’),
      • block ‘Create original data folder’:
        • if not exists, creates a new folder for the downloaded file: ‘‘/home/ubuntu/Desktop/ETL_Output/Helsinki_youth_subsidies_XLS/Original_data’
      • block ‘HTTP’: makes the HTTP REST request (to ‘http://www.nk.hel.fi/avoindata/avoin_data-avustukset.xls’) and saves the xls file (named ‘avoin_data-avustukset.xls’) in the folder created in the previous block
      • block ‘File Exists’: checks if the uploaded file is already present
      • block ‘Create Elaborated data folder’: ‘‘/home/ubuntu/Desktop/ETL_Output/Helsinki_youth_subsidies_XLS/Elaborated_data’’
      • block ‘DUMMY’: do nothing
      • block ‘process data’: writes the downloaded file, made data analysis and save the result in a new file in the folder ‘‘/home/ubuntu/Desktop/ETL_Output/Helsinki_youth_subsidies_XLS/Elaborated_data/Result_data.csv’
      • block ‘Success’: ends the process.
    • How Verify, Fig. l:
      • Search the folder into the file system and verify the presence of the files, you can also delete them, and launch again the ETL process, Fig. l.

Fig. l – How verify the Helsinki_youth_subsidies_XLS Results.

 


Example 3: To Access to a large library of ETL process for smart city, from GITHUB:


Fig. n – Process Loader: List of Public Processes.

 

Example 4: Modify ETL processes for data transformation

  • To modify a process, it is possible to follow the steps explained in the previous point 1 (from Fig. a to Fig. l). For example, it is possible to open one of the ETL, that are present in the virtual Machine and it. It is possible to open the ETL called ‘Florence_Pharmacies_csv’, capable to: (i) download a dataset; ii) transform it; iii) save the data transformed in a file (into the VM):
    • Launch the spoon tool (see point[1]) and Open the ETL (Click on the menu and select: ‘File > Open’ > search file: ‘Desktop/snap4cityETL/Florence_Pharmacies_csv/Ingestion/Main.kjb’), Fig. o.


Fig. o – ETL opened (file: ‘Desktop/snap4cityETL/Florence_Pharmacies_csv/Ingestion/Main.kjb’).

 

  • Run the process and go to the Folder ‘home/Desktop/ETL_Output/Florence_Pharmacies_CSV/Elaborated_data’, and open the file ‘Result_data.csv’



Fig. p – ETL opened (file: ‘Desktop/snap4cityETL/Florence_Pharmacies_csv/Ingestion/Main.kjb’).

 

  • To modify it, click on the ‘process data’ block (right mouse click over the icon > Open Referenced Object > Transformation), Fig. p. A new tab ‘Process_Data’ is opened.



Fig. q – Modify an ETL: open a Transformation block.

 
•          Click on the tab ‘Process_Data’, from the tab ‘Design’ (left side of the user interface) choose the block ‘Transform > Replace in String’ and connect the block to the flow (after the ‘Select value’ block and before the ‘Text file output’ block), Fig. r.


Fig. r – Modify an ETL: open a Transformation block and use the Drag & Drop functionality.

 

  • Click on the ‘Replace in string’ block and: i) select the field ‘Denomination’; ii) search in this field the string ‘Farmacia’ (present in the previous output file, Fig. p); iii) Replace the string ‘Farmacia’ with its English translation ‘Pharmacy’. Click on the ‘Ok’ button.


Fig. s – Modify an ETL: Replace in String block (from ‘Farmacia’ to ‘Pharmacy’).

 

  • Save all the files. Run again the ETL, go again to the Folder ‘home/Desktop/ETL_Output/Florence_Pharmacies_CSV/Elaborated_data’, and open the file ‘Result_data.csv’: you finds the string ‘Pharmacy’ instead of the previous ‘Farmacia’, Fig. t.



Fig. t – Modify an ETL: Replace in String block (from ‘Farmacia’ to ‘Pharmacy’).

 


Example 5: Develop a new ETL process for data transformation with Penthao Kettle.
Using: external services, direct access to data store, Advanced Smart City API. This can be performed: (i) using the VM provided, downloading it, putting in execution and developing; (ii) accessing with VNC to a ready to use VM for direct development:

  • Open spoon, create a new ETL. A new job can contain: other jobs, transformations, single blocks. Save the ETL in the filesystem.


Fig. u – New ETL.

 

  • Start the job Flow connecting the blocks. It is possible to use the blocks visible in the left panel (some blocks are for jobs, some for transformations)



Fig. v – New ETL: blocks.