TC6.6 - Producing data-sets in Bundle via ETL

Test Case Title

TC6.6 - Producing data-sets in Bundle via ETL

Goal

Collected referential data from ETL processes

Cumulated data into a storage

Make the data accessible for DataAnalytics

Create ETL for generating data set in bulk and post them on DataGate/CKAN or saved the data in some FTP, or in some file system solution.

Prerequisites

Using a PC or Mobile with a web browser. Conquer a minimal skill on producing ETL processes. See provided user manuals (Pentaho Kettle tool).

The following functionalities are available only for specific Snap4city users with specific privileges.

Expected successful result

Data collected in bulk and available in DataGate/CKAN, in some FTP, or in some file system solution.

Steps

 

 

Please note that to correctly perform this Test Case you need and access to the ETL Virtual Machine as described below. To have access to a Virtual Machine to perform ETL please contact snap4city@disit.org. In alternative you can download  the VM and use on your premise.

 

Users roles for which this test case is available: Area Manager, ToolAdmin, RootAdmin

Users roles for which the test case is not available: Manager

User used to describe the Example: ‘snap4city’

 


Example 1: To automatically publish datasets in Bulk by using ETL

To automatically producing and publishing in Bulk Dataset, they are produced by using ETL applications and posted/published on DataGate.

To demonstrate these features:

  1. How to use/open the Virtual Machine
  2. How to create an ETL process for producing dataset in bulk taking the data from the data stores (Filesystem and DataGate/CKAN)
  3. How to put in execution the ETL process manually
  4. How to verify the publication performed (on Filesystem and DataGate/CKAN )
  5. How to put into in the back office DISCES scheduler this process and set its periodic execution

1.1: How to use/open the Virtual Machine

  • To use the Virtual Machine to develop ETLs. Two methodologies can be adopted (using the credentials reported in Section 3.4 Development Environment):
    • CASE 1) Open the VM from the snap4city home page (https://www.snap4city.org)
    • CASE 2) Open the VM directly from Teamviewer (deprecated)

 
CASE 1) Open the VM from the snap4city home page

  • Go to the snap4city home page (https://www.snap4city.org)
  • Make the login
  • If you have the developer permissions, you finds the menu: ‘Development Tools > ETL development’
  • Click on the menu and insert the correct password for the VM (the same credentials two times, Fig a1, b)

 

Fig.a1: vnc connection: login.

 


Fig.b: Virtual Machine connection.

For the snap4city snap4city users with the role Manager, the Development tools are not available. For example, if a Manager (user name: ‘finaluser1’) goes to the snap4city home page (https://www.snap4city.org), he/she cannot view the ‘Development’ menu. In fact the Managers actually are users with no permissions related to the development activites.


Fig.a2: snap4city home page for Manger and public users.

 

1.2: How to create an ETL for producing dataset in bulk taking the data from the data stores (Filesystem and DataGate/CKAN)

  • Open a Terminal emulator (search the Icon  at the bottom of the screen or click on ‘Application’ menu > ‘Terminal Emulator’), then write the command ‘spoon.sh’ and click ‘enter’, Fig. d.

 


Fig. d –Virtual Machine (VM) Terminal Emulator (command line).

 

 

 

 

  • The Spoon Interface will open (Fig. d), now it is possible to create/load ETL processes.


Fig. e – Spoon Interface.

 

Click on the menu and select: ‘File > Open’ and select the file from: ‘Ubuntu/Desktop/snap4cityETL/from_KM4cityKB_to_Datagate_generic/Ingestion/Main.kjb’, Fig. f


Fig. f Spoon Interface: open the ‘from_KM4cityKB_To_Datagate_generic’  ETL.

In the Spoon Interface the Main Spoon Job appears, Fig. g. 


Fig. g – Spoon Interface:  the ‘from_KM4cityKB_To_Datagate_generic ‘ ETL opened.

 

1.3: How to put in execution the ETL process manually

  • To run the ETL process, click on the main ‘Run’ button (), a popup will open with a set of default values (including the url from which to download the dataset: ‘’), click on the Second ‘Run’ button, Fig. h.


Fig. h – Spoon Interface: launch an ETL (‘Run button’).

 

 

  • Once launched the process, a set of windows will open. One relevant is the ‘Logging window’ from which it is possible to read the logs. When the logging windows report the sentence ‘‘Spoon – job has ended’, the job is ended with no errors, Fig. i.


Fig. i – Spoon Interface: ETL executes, logging details

1.4: How to verify the publication performed (on Filesystem and DataGate/CKAN)

  • What are the actions done by this process and how it is possible to verify them? Some block visible in Fig. I, can be expanded, with double left click or 1 right click of the mouse.
    • Activities:
      • block ‘Start’: start the process (see its details with: ‘right click > edit’),
      • block ‘Create original data folder’:
        • if not exists, creates a new folder for the downloaded file: ‘‘/home/ubuntu/Desktop/ETL_Output/ from_KM4cityKB_to_Datagate_generic /Original_data’
      • Transformation ‘Download from HBASE’: Read the data from the table stored on Apache Phoenix, enrich them with latitude and longitude gathered by querying the service map by the serviceUri and generate the relative csv file in ‘‘/home/ubuntu/Desktop/ETL_Output/ from_KM4cityKB_to_Datagate_generic
      • transformation ‘Create dataset’: enrich the generated csv such as creation date, organization name need for Data Gate
      • block ‘Create Resource: prepare and send the http request to DataGate with the csv as payload
      • block ‘Success’: ends the process.

 

  • How Verify, Fig. l:
    • See the log directly on the Virtual machine (on the Pentaho tab ‘Logging’)
    • See the Log on the DataGate and verify the last update, Fig. l.
      • Go to the ‘Data Set Manager: Data Gate’ menu
      • Go to the snap4city user page
      • See the logs

 


Fig. l – Logs on Datagate.

 

1.5 How to put into in the back office DISCES scheduler this process and set its periodic execution

To put a process into the scheduler DISCES, it can be done using the ProcessLoader tool following the steps:

Make the Login

Fig. : web browser from the VM to directly upload an ETL on the Resource Loader.

  • Click on the ‘Uploaded Resources’ tab, upload your resource. Then the resource (your ETL) is available directly clicking in the ‘Resource Manager: Process Loader > Resources’ menu


Fig.: upload a new resource, from the VM file system.

 

  • Create a Process Model associated to the ETL:


Fig.  – Upload a new Resource on the Resource Manager.

  1. Now you are in the menu ‘Resource Manager: Process Loader > Resources’, you can search for the ETL you want to schedule:
  • search for example the ‘Electric vehicle charging’, using the filter in the right top of the web page
  • select a Resource (for example: ‘Electric vehicle charging’)
  • click on the ‘NEW’ button (in the ‘Process Model’ column of the table) and put the metadata. For the ‘Electric vehicle charging’ ETL, it is possible to put:
    • ‘Process Parameters’ tab:
      • Name= ’Electric vehicle charging’
      • Description = ‘info on Electric vehicle charging in Florence’
      • Group = ‘Services’
    • ‘Trigger’ tab:
      • Name= ’Electric vehicle charging_trigger’
      • Description = ‘description…’
      • Group = ‘Services_trigger
    • ‘Advanced Parameters’ tab:
      • Needed if it is necessary to execute two different process instances in series (typically is void).


Fig.  – Upload a new Resource on the Resource Manager.


Fig. : New Process Model: Parameters


Fig.: New Process Model: Advanced parameters


Fig: New Process Model: Trigger

 

  • Create a New Instance of the process Model (related to your ETL) and schedule it:

To put in execution a Process, at least one Process Instance (associated to the Process Model created in the previous Paragraph) must be created. The instance can then be executed in a scheduler. To do this, follow the steps:

  • Click on the ‘Process Model’ menu and search for a specific Process Model (e.g. the Process Model associated to the ‘Electric vehicle charging’ Resource).


Fig. : Process Models and new Instances.

    • Click on the ‘NEW’ Process Instance button (column ‘New Instance’ of the table) put the necessary metadata and create it (Click the ‘Confirm’ button).


Fig.: Create a new Instance of a Process Model (related for example to an ETL).

 

  • Click on the ‘VIEW’ button (‘Show Instances’ column of the table) to verify its presence


Fig.: Verify the instance creation.

  • Now the process is in execution and you can find it in the list (tab ‘Processes in Execution’)

 


Fig.: Processes in Execution list.

  • Once your process is in execution, you can monitor its status on the SCHEDULER (DISCES):
    • From the ‘Process in execution List page’, it possible (see the figure above):
      • Do actions on its execution:
        •        See the executions log

 

Fig.: Processes in Execution details.

 

  •      Start the (or restart) the process execution
  •      Stop the process execution
  •       Delete the process execution
  • See the process execution from the scheduler (called DISCES) view: click on the link ‘Test Scheduler Node’ and the following view appears


Fig.: Processes details.

  • It is possible to monitor the job from the home page:
    • Click on the Filter button (at the bottom of the page)
    • Make textual searches on each column

 
Fig.: Processes in Execution search.

Note that "Status" value of a newly created job is set to "CREATED" and it is changed each time the user starts or stops running a job from the commands on the right side of the line. The content of the "status" column is continually updated by means of a function that every two minutes sends to the various schedulers in which requests processes are entered to know the execution state and depending on the response received, the value is updated accordingly.

The values ​​that the Status can assume are: 

  • CREATED: is the value that defaults to a process right after it was created, before it is updated for the first time
  • NORMAL: The process is running correctly.
  • NONE: Trigger associated to the process still not exist (per example the execution time interval is ended)
  • BLOCKED: The execution of the process was blocked by the user.
  • PAUSED: The execution of the process has been paused.
  • ERROR SERVER COMMUNICATION: The request was successfully sent, but internal issues for the scheduler server could not be received.
  • NOT FOUND: The application sent a request to the scheduler for that process, but did not find any corresponding process to the information sent.
  • RUNNING: The process assumes this value immediately after a start execution request has been sent.