TC6.6 - Producing data-sets in Bundle via ETL

Test Case Title

TC6.6 - Producing data-sets in Bundle via ETL

Goal

Collected referential data from ETL processes

Cumulated data into a storage

Make the data accessible for DataAnalytics

Create ETL for generating data set in bulk and post them on DataGate/CKAN or saved the data in some FTP, or in some file system solution.

Prerequisites

Using a PC or Mobile with a web browser. Conquer a minimal skill on producing ETL processes. See provided user manuals (Pentaho Kettle tool).

The following functionalities are available only for specific Snap4city users with specific privileges.

Expected successful result

Data collected in bulk and available in DataGate/CKAN, in some FTP, or in some file system solution.

Steps

 

 

Please note that to correctly perform this Test Case you need and access to the ETL Virtual Machine as described below. To have access to a Virtual Machine to perform ETL please contact snap4city@disit.org. In alternative you can download  the VM and use on your premise.

 

Users roles for which this test case is available: Area Manager, ToolAdmin, RootAdmin

Users roles for which the test case is not available: Manager

User used to describe the Example: ‘snap4city’

 


Example 1: To automatically publish datasets in Bulk by using ETL

To automatically producing and publishing in Bulk Dataset, they are produced by using ETL applications and posted/published on DataGate.

To demonstrate these features:

  1. How to use/open the Virtual Machine
  2. How to create an ETL process for producing dataset in bulk taking the data from the data stores (Filesystem and DataGate/CKAN)
  3. How to put in execution the ETL process manually
  4. How to verify the publication performed (on Filesystem and DataGate/CKAN )
  5. How to put into in the back office DISCES scheduler this process and set its periodic execution

1.1: How to use/open the Virtual Machine

  • To use the Virtual Machine to develop ETLs. Two methodologies can be adopted (using the credentials reported in Section 3.4 Development Environment):
    • CASE 1) Open the VM from the snap4city home page (https://www.snap4city.org)
    • CASE 2) Open the VM directly from Teamviewer (deprecated)

 
CASE 1) Open the VM from the snap4city home page

  • Go to the snap4city home page (https://www.snap4city.org)
  • Make the login
  • If you have the developer permissions, you finds the menu: ‘Development Tools > ETL development’
  • Click on the menu and insert the correct password for the VM (the same credentials two times, Fig a1, b)

 

Fig.a1: vnc connection: login.

 


Fig.b: Virtual Machine connection.

For the snap4city snap4city users with the role Manager, the Development tools are not available. For example, if a Manager (user name: ‘finaluser1’) goes to the snap4city home page (https://www.snap4city.org), he/she cannot view the ‘Development’ menu. In fact the Managers actually are users with no permissions related to the development activites.


Fig.a2: snap4city home page for Manger and public users.

 

1.2: How to create an ETL for producing dataset in bulk taking the data from the data stores (Filesystem and DataGate/CKAN)

  • Open a Terminal emulator (search the Icon  at the bottom of the screen or click on ‘Application’ menu > ‘Terminal Emulator’), then write the command ‘spoon.sh’ and click ‘enter’, Fig. d.

 


Fig. d –Virtual Machine (VM) Terminal Emulator (command line).

 

 

 

 

  • The Spoon Interface will open (Fig. d), now it is possible to create/load ETL processes.


Fig. e – Spoon Interface.

 

Click on the menu and select: ‘File > Open’ and select the file from: ‘Ubuntu/Desktop/snap4cityETL/from_KM4cityKB_to_Datagate_generic/Ingestion/Main.kjb’, Fig. f


Fig. f Spoon Interface: open the ‘from_KM4cityKB_To_Datagate_generic’  ETL.

In the Spoon Interface the Main Spoon Job appears, Fig. g. 


Fig. g – Spoon Interface:  the ‘from_KM4cityKB_To_Datagate_generic ‘ ETL opened.

 

1.3: How to put in execution the ETL process manually

  • To run the ETL process, click on the main ‘Run’ button (), a popup will open with a set of default values (including the url from which to download the dataset: ‘’), click on the Second ‘Run’ button, Fig. h.


Fig. h – Spoon Interface: launch an ETL (‘Run button’).

 

 

  • Once launched the process, a set of windows will open. One relevant is the ‘Logging window’ from which it is possible to read the logs. When the logging windows report the sentence ‘‘Spoon – job has ended’, the job is ended with no errors, Fig. i.


Fig. i – Spoon Interface: ETL executes, logging details

1.4: How to verify the publication performed (on Filesystem and DataGate/CKAN)

  • What are the actions done by this process and how it is possible to verify them? Some block visible in Fig. I, can be expanded, with double left click or 1 right click of the mouse.
    • Activities:
      • block ‘Start’: start the process (see its details with: ‘right click > edit’),
      • block ‘Create original data folder’:
        • if not exists, creates a new folder for the downloaded file: ‘‘/home/ubuntu/Desktop/ETL_Output/ from_KM4cityKB_to_Datagate_generic /Original_data’
      • Transformation ‘Download from HBASE’: Read the data from the table stored on Apache Phoenix, enrich them with latitude and longitude gathered by querying the service map by the serviceUri and generate the relative csv file in ‘‘/home/ubuntu/Desktop/ETL_Output/ from_KM4cityKB_to_Datagate_generic
      • transformation ‘Create dataset’: enrich the generated csv such as creation date, organization name need for Data Gate
      • block ‘Create Resource: prepare and send the http request to DataGate with the csv as payload
      • block ‘Success’: ends the process.

 

  • How Verify, Fig. l:
    • See the log directly on the Virtual machine (on the Pentaho tab ‘Logging’)
    • See the Log on the DataGate and verify the last update, Fig. l.
      • Go to the ‘Data Set Manager: Data Gate’ menu
      • Go to the snap4city user page
      • See the logs

 


Fig. l – Logs on Datagate.

 

1.5 How to put into in the back office DISCES scheduler this process and set its periodic execution

To put a process into the scheduler DISCES, it can be done using the ProcessLoader tool following the steps:

Make the Login

Fig. : web browser from the VM to directly upload an ETL on the Resource Loader.

  • Click on the ‘Uploaded Resources’ tab, upload your resource. Then the resource (your ETL) is available directly clicking in the ‘Resource Manager: Process Loader > Resources’ menu