TC6.5 - Managing Heterogeneous File Ingestion via ETL processes

Primary tabs

Test Case Title

TC6.5 - Managing Heterogeneous File Ingestion via ETL processes

Goal

I can:

Ingest any kind of file format from ETL processes

Prerequisites

Using a PC or Mobile with a web browser. Using ETL processes.

The following functionalities are available only for specific Snap4city users with specific privileges.

Expected successful result

Generating an ETL process or listing the capabilities and putting in execution the examples provide into the development kit accessible via VNC or in the VM installed on premise.

Steps

 

 

Please note that to correctly perform this Test Case you need and access to the ETL Virtual Machine as described below. To have access to a Virtual Machine to perform ETL please contact snap4city@disit.org.



Snap4City platform gathers information of the city from several sources, via data driven, stream, sporadic and/or periodic processes (ETL, Java or any other kind). Moreover, it is capable to convert the information for the Knowledge Base making it reconciled for geospatial query via ETL processes. In this case, once a file is imported (downloaded, for example) it has to be ingested and its data are mined and stored into the Hbase for versioning, reconciliation and quality improvement. Several different formats and structure can be addressed by creating specific ETL processes for each data source family. Several examples of the usage of ETL processes are accessible on DISIT lab GITHUB, that have been developed and are in place now to manage the data ingestion for the Smart City of Florence and Tuscany. 

See ETL Smart City examples: https://github.com/disit/smart-city-etl

In effect, the ETL processes are used for data gathering of both static and real-time data collecting files from HTTP/FTP protocols, such as: data from traffic sensors, parking lots, weather forecast, cost of fuels, environmental data, etc. Then the mined data are stored into noSQL data base such as: Hbase/Phoenix or Mongo storage and/or RDF storage for the Knowledge Base in triples. This allows exploiting the data for data analytic, dashboards, etc. You can start testing this requirement by following the instructions described for the ProcessLoader.

The snap4city ETL process can upload, transform and manage data: such as downloading a file from an external data source, extracting its contents and save in a database or in the file system, etc.

Examples of sources and data ingestion processes by ETL and Snap4City Applications are:

  1. OpenStreetMap: from OSM to Knowledge Base, also integrating civic number location from any Buyer data base, see also the process described into manual: From the Open Street Map to the Km4City street graph.pdf, loaded into the Google Drive.
    1. To test click on the link to see live examples:
  1. GTFS data about Public Transportation schedule, stops, paths, etc.;
    1. To test click on the link to see a live example: see ETL table below
  2. DATEX information about the ITS of the city, events;
    1. see ETL table below
  3. Crawling public web pages for collecting additional information;
    1. For example: the crawling of Hospital Triage. see ETL table below
  4. Open Data of the city;
    1. To test click on the link to see a live example: see ETL table below
  5. Parking status;
    1. To test click on the link to see a live example: see ETL table below
  6. Twitter data from Twitter.com directly and/or from TwitterVigilance;
    1. To test click on the link to see a live example:
  1. XML, HTML, JSON; CSV, WSDL, XLS formats
  1. See the table below containing links to live examples
  1. data accessible as External Services registered on the platform on the MicroService Directory; (not yet available). Presently they are directly ingested by calling the services with their protocols.
  2. SigFox gateways with Node.js and wrappers, or ETL, etc.
  3. Lora gateways with Node.js and wrappers, or ETL, etc.
  4. Any other format and protocol can be easily added

The following ETL files are included in the ETL zip file (see link above):

ETL_name, file format

Type

Protocol/ Standard

Data License

Link on https://www.snap4city.org/download

Florence_firstAid_accesses_HTML

HTML

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Florence_firstAid_accesses_HTML/

">

Florence_Parking_JSON (static & realTime)

Json

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Florence_Parking_JSON/

Florence_Weather_XML

XML

HTTP

Arpat, Tuscany region

https://www.snap4city.org/download/snap4cityETL/Florence_Weather_XML/

Florence_Pharmacies_CSV

Csv

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Florence_Pharmacies_CSV/

Helsinki_youth_subsidies_XLS

XLS

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Helsinki_youth_subsidies_XLS/

Electric_vehicle_charging_kmz

Kmz

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Electric_vehicle_charging_kmz/

Electric_vehicle_charging_kmz_phoenix

Kmz

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Electric_vehicle_charging_kmz_phoneix/

Bike_Sharing_Areas_Shp

Shape

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Bike_Sharing_Areas_Shp/

Tpl_bus_gtfs

gtfs

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/TPL_bus_gtfs/

Smartbench

OneM2M

M2M

Under Comune

di Firenze authorization

https://www.snap4city.org/download/snap4cityETL/Smartbench_M2M/

LinkedData

N3

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/LinkedData/

Florence_School_canteen

Shape

FTP

Open data

https://www.snap4city.org/download/snap4cityETL/Florence_School_Canteen_FTP/

Tuscany_parking

Datex II

SOAP,

Rest API

Under MIIC

authorization

https://www.snap4city.org/download/snap4cityETL/Tuscany_parking_Datex%20II/

via_francigena_farmhouse_GeoJson

GeoJson

Rest API, JDBC

Under Regione

Toscana authorization

https://www.snap4city.org/download/snap4cityETL/via_francigena_farmhouse_GeoJson/

sigFOX

SigFOX

https

DISIT lab

https://www.snap4city.org/download/snap4cityETL/SigfoxSM/

From_KM4cityKB_to_Datagate

From HBASE to csv

HTTP

DISIT lab

https://www.snap4city.org/download/snap4cityETL/from_KM4cityKB_to_Datagate/

 

 

ETL

Source:

ETL Description

Florence_firstAid_accesses_HTML

http://www.asf.toscana.it/estar/accessi-internet.php

This ETL (static)

  • Read data coming from a web page (html)
  • Write the data as they are downloaded in the VM file system
  • Transform the data from html to a csv file with only the interesting information

Florence_Parking_JSON (static & realTime)

http://opendata.comune.fi.it/od/ParkInfo_Firenze_SMN.json

This ETL is composed of two phases:

STATIC phase:

  • Read data coming from a web server in a JSON format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from html to a csv file with only the STATIC data (Phone number, fax, address, city, country, zipcode, Opening hours, parking fee, available places, latitude, longitude)

REAL TIME phase:

  • Read data coming from a web server in a JSON format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from html to a csv file with only the REAL TIME data (free spot, update date)

 

Florence_Weather_XML

Arpat, Tuscany region

This ETL (REAL TIME):

  • Read data coming from a web server in an xml format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from html to a csv file with only the REAL TIME data (sunrise, sunset, sun_altitude, min_temperature, max_temperature, etc.) and add some information (date)

Florence_Pharmacies_CSV

http://opendata.comune.fi.it/od/turno_farmacie.csv

This ETL (static)

  • Read data coming from a web server in a csv format
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write the results on a csv file with only some relevant data (e.g. Pharmacy name, street, civic number, opening times, etc.)

Helsinki_youth_subsidies_XLS

http://nk.hel.fi/avoindata/avoin_data-avustukset.xls

This ETL (static)

  • Read data coming from a web server in a XLS format
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write the results on a csv file containing (e.g. Beneficiary name, Office, task, etc.)

Electric_vehicle_charging_kmz

http://datigis.comune.fi.it/kml/ColonnineRicarica.kmz

This ETL is composed of two phases:

REAL TIME phase:

  • Read data coming from a web server in a kmz format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from kmz to a csv file with only the REAL TIME data (charging station status) 

Electric_vehicle_charging_kmz_phoenix

http://datigis.comune.fi.it/kml/ColonnineRicarica.kmz

This ETL is composed of two phases:

Static phase:

  • Read static data coming from a web server in a kmz format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from kmz to a set of triples (connection with the km4city multi ontology)

REAL TIME phase:

  • Read real time data coming from a web server in a kmz format
  • Write the data in HBase (via Phoenix)

Bike_Sharing_Areas_Shp

http://datigis.comune.fi.it/shp/bike_sharing.zip

This ETL (static)

  • Read data coming from a web server in a Shape (SHP) format
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write the results on a csv file containing static data (e.g. identifier, address, station name, etc.)

Tpl_bus_gtfs

http://dati.toscana.it/dataset/8bb8f8fe-fe7d-41d0-90dc-49f2456180d1/resource/954f2767-b803-4a41-baaa-37ccc8beb163/download/amvbus.gtfs

This ETL (periodic)

  • Read data coming from a web server in a GTFS format
  • Write the (unzipped) data as they are downloaded in the VM file system (gtfs standard files: agency.txt, calendar_dates.txt, routes.txt, shapes.txt, stops.txt, stop_times.txt, trips.txt)

Smartbench

https://icon-lab.tim.it/onem2m/firenze/smartbench/bench1/la

This ETL (REAL TIME):

  • Read data coming from a web server in a json format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from json to a csv file with only the REAL TIME data (creation time, temperature, humidity, pressure, transits, sittings, pollution, etc.)

LinkedData

http://linkeddata.comune.fi.it:8080/all

This ETL (static)

  • Read data coming from a web server in rdf
  • Write the data as they are downloaded in the VM file system (triples in rdf)

Florence_School_canteen

Disit FTP

This ETL (static)

  • Read data coming from a web page in shape
  • Write the data as they are downloaded in the VM file system (unzipped: dbf, fix, prj, qix, shp, shx files)

Tuscany_parking

http://www501.regione.toscana.it/osservatoriotrasporti/

This ETL (static)

  • Read data coming from a web server in shape
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write them in an n3 file (triples connecting the dataset to the KM4city Ontology)

via_francigena_farmhouse_GeoJson

http://www306.regione.toscana.it/

This ETL (static)

  • Read data coming from a web page (geojson)
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write the results on a csv file containing static data (e.g. identifier, name, province, municipality, type, coordinates, etc.)

sigFOX

https://backend.sigfox.com/api/devices/

This ETL (REAL TIME):

  • Read data coming from a web server in a json format
  • Write the data as they are downloaded in the VM file system (data, device id, temperature, coordinates, etc.)

From_KM4cityKB_to_Datagate

Disit knowledge BASE

This ETL (REAL TIME):

  • Make a query on the Disit database (HBase)
  • Download a set of data and: i) write them in the Virtual Machine file system; ii) use the DataGate API and publish the data as a new new file in a dataset on the Datagate Portal