Snap4City Harmonized Data Ingestion process


Warning message

You can't delete this newsletter because it has not been sent to all its subscribers.

in general the data ingestion for smart city is a very articulated solution since several kind of data sources are addressed:

Snap4City Data Ingestion Diagram Flow is reported in figure.

It includes all the aspects necessary to ingest the data in a manner that allow to semantically aggregated and registered them in the Knowledge Base (KB) compliancewith the Km4City multi-ontology. The first action of the Snap4City data ingestion process is the Road Graph Setup. This usually is based on collecting and integrating data related to streets coming from public administrations and city government as well as open datasets such as Open Street Map, or open data of local government. This phase provides geo-localization and connections of all the datasets to road graphs. In this way, Points of Interest (POI), Sensors, Citizens, etc., can be located in a specific place ofthe city in addition to their coordinates which are not always present in the datasets.

After the Road Graph Setup, for each dataset, it is necessaryto understand if: i) it is only static; or ii) if it has also some dynamic fields that can change in the future (and with which rate).

if the data are all static such as a collection of POI data do as follows: 

The aim is to produce a file of triples that have to loaded into the KB. (on the cloud this operation has to be done by RootAdmin, in the on premise version the oener can do it authonomously)

If the data set cannot be regularized an ad-hoc solution has to be developed as follows:

The second case is more interesting as a non-regular dataset can have both static and dynamic info. Many examples are present in smart cities such as: a car park monitoring system has a fixed location that registers different kinds of data every minute (i.e., number of free slots, the amount of time eachslot is free or occupied); or as second example, a sensor for registering the number of people coming in a museum probably has a fixed location, but counts the people every second; an air quality sensor placed on a bus continuously moves as the bus moves and takes measurements in real time, etc. Thus, if the data are not regular, different methodologies to classify/manage/exchange data (e.g Push or Pull) have to be adopted. In general, one could present the need to choose between the two main methods of data ingestion (ETL or IoTApp) and the related tools.

  • A) ETL process ingestion: In this case, the data are ingestion by developing an ETL process. For the development of ETL processes, the Pentaho Kettle Open Source tool has been adopted and  integrated in the system and provided to developers. For each dataset, two ETLs processes/scripts are typically created: (1) Static ETL for addressing static aspects and ingesting them into the KB, also creating relationships with the other city entities in the KB; and (2) a periodic ETL put in execution periodically by DISCES scheduler to collect the real time changing data and copy into the data store.  The period adopted by the Scheduler is determined on using the data change frequency or the licences/rules specified by the data provider. The data storage can be implemented by using two  different methods: a) a Big Data Cluster (based on HDFS, HBase, Phoenix); or b) an Indexing and Aggregating tools (e.g., based on Elastic Search). Each solution  has its pros and cons, but in both cases replica and federations can be set up, with vertical and horizontal scaling, thus creating a large data store with some indices. In both cases,  queries are performed by using NoSQL approaches via API, and many constructs of classical SQL cannot be used. In this paper, we do not perform comparisons of these two solutions, even if the ingestion processes have been tested and assessed using both solutions.
  • B) IOT App process ingestion:
  • IoT data is typically sent in push mode using a publisher/subscriber protocol. IoT devices are registered in an IoT Broker which is registered on the Snap4City IoTDirectory, or vice-versa. When a new IoT Device is connected to the Snap4City IoT Directory, it : i) registers the static data (the IoT Device description and data model) on the KB; and then ii) sends a command to the storage system to make a subscription to the corresponding IOT broker to receive all the new messages for the storage. In case of HBase, a specific process is set up for writing into the storage, and it is implemented by using ETL or IoT, and thus it is performed for each new entity. On the contrary, for the case of Elastic Search, a scalable NIFI Apache ingestion process has been implemented to automatically subscribe the IoT Brokers on all its devices, and feed the Elastic Search engine, thus creating the data shadow for IoT data. The IoT Applications can be used for data ingestion of different kinds of protocols using both pull and push modes. In our system, we have adopted an IoT App flow for registering the data model as an IoT Device on the IoT Broker and IoT Directory by using the Snap4City APIs; then a second IoT App flow is dedicated to register all the metadata and descriptors for modeling the new entry into the KB, which cannot be passed by the IoT Directory and broker to the KB. When a new dataset needs to be ingested, if the methodology adopted is ETL or IoTApp, it is possible to create an ad-hoc semantic mapping to connect each sensor/IoTDevice, POI, etc. to the KM4City ontology. This way, we can put the dataset and its related metadata/data in the Snap4City Knowledge Base (realized in Virtuoso).The mapping creates a set of RDF Triples based on the KM4City classes and properties and then add them to the Snap4City KB. If the ingestion methodology adopted is  IoTApp, the most relevant triples are automatically created by the system and added to the Snap4City ontology when each sensor is registered (the registration can be done with an easy to use web tool also in bulk), then  developers are free to add other specific RDF triples via IoTApp. Developers have the freedom to describe  new sensors with as many triples as possible in order to make the sensor data more reusable and more connected with all the types of resources and entities defined in KM4City. Developers are supported all the time thanks to the Living Lab and co-creation activities  available on the Snap4City platform.