in general the data ingestion for smart city is a very articulated solution since several kind of data sources are addressed:
- full view of possible data ingestion processes: HOW TO: add data sources to the Platform
Snap4City Data Ingestion Diagram Flow is reported in figure.
It includes all the aspects necessary to ingest the data in a manner that allow to semantically aggregated and registered them in the Knowledge Base (KB) compliancewith the Km4City multi-ontology. The first action of the Snap4City data ingestion process is the Road Graph Setup. This usually is based on collecting and integrating data related to streets coming from public administrations and city government as well as open datasets such as Open Street Map, or open data of local government. This phase provides geo-localization and connections of all the datasets to road graphs. In this way, Points of Interest (POI), Sensors, Citizens, etc., can be located in a specific place ofthe city in addition to their coordinates which are not always present in the datasets.
After the Road Graph Setup, for each dataset, it is necessaryto understand if: i) it is only static; or ii) if it has also some dynamic fields that can change in the future (and with which rate).
if the data are all static such as a collection of POI data do as follows:
The aim is to produce a file of triples that have to loaded into the KB. (on the cloud this operation has to be done by RootAdmin, in the on premise version the oener can do it authonomously)
- A) CKAN path: The most frequent condition for the cities is to collect open data with POIs on specific tools such as CKAN that is the most widely adopted to create Official Open Data Portals in Europe. For this reason, the simplest solution is to automate the process from CKAN to KB using a plugin module of CKAN called DataGate. It regularizes the open data following a specific template to fit them into a standard model that can be processed by a transformation process implemented as an Extract Transform and Load (ETL). The ETL process is put in execution by the DISCES distributed Scheduler. EachDataGate ETL process has to integrate the data with the KB by reconciling the new information with the city entities already in place. For example, the position of each new restaurant has to be connected with the GPS locations of the civic number of their corresponding addresses. Therefore, the restaurant has coordinates, location (city, province, street name, civic number, etc.), name/e-mail of the data provider responsible,etc. [14], [15].
- Web page or Web Service or FTP service or API Rest Call, etc. In this case, the best solutions can be to create a periodic or sporadic script to ingest them. You have two possibilities:
- CKAN service of your city, you can set the Snap4City DataGate to access at that CKAN to get them automatically. See:
- B) IOT App path (upper part): The open data file describing the POI has to be regularized in a standard format for the ingestion, and passed to ETL or IOT App. So fare we have used ETL processes. We are recently passing to developing an IOT App that can be used for the same purpose. That is produce a file of triples that have to loaded into the KB. The standard format expected for the POI ingestion is described in:
- CSV data ingestion format for POIs https://www.snap4city.org/589
- you can generate the triples (data for the ingestion) to be uploaded into the Knowledge Base by loading the above mentioned CSV file on the service accessible from: https://iot-app.snap4city.org/nodered/nr9xtwc/simple
- the procedure directly generate the triples sending to you an email
- the procedure for triple upload HOW TO Upload data into Knowledge Base ServiceMap (triple upload).
- An IOT App for the generation of the triples is accessible from: Example: an IOT App for generating RDF triples, loading POI into Knowledge Base
If the data set cannot be regularized an ad-hoc solution has to be developed as follows:
The second case is more interesting as a non-regular dataset can have both static and dynamic info. Many examples are present in smart cities such as: a car park monitoring system has a fixed location that registers different kinds of data every minute (i.e., number of free slots, the amount of time eachslot is free or occupied); or as second example, a sensor for registering the number of people coming in a museum probably has a fixed location, but counts the people every second; an air quality sensor placed on a bus continuously moves as the bus moves and takes measurements in real time, etc. Thus, if the data are not regular, different methodologies to classify/manage/exchange data (e.g Push or Pull) have to be adopted. In general, one could present the need to choose between the two main methods of data ingestion (ETL or IoTApp) and the related tools.
- A) ETL process ingestion: In this case, the data are ingestion by developing an ETL process. For the development of ETL processes, the Pentaho Kettle Open Source tool has been adopted and integrated in the system and provided to developers. For each dataset, two ETLs processes/scripts are typically created: (1) Static ETL for addressing static aspects and ingesting them into the KB, also creating relationships with the other city entities in the KB; and (2) a periodic ETL put in execution periodically by DISCES scheduler to collect the real time changing data and copy into the data store. The period adopted by the Scheduler is determined on using the data change frequency or the licences/rules specified by the data provider. The data storage can be implemented by using two different methods: a) a Big Data Cluster (based on HDFS, HBase, Phoenix); or b) an Indexing and Aggregating tools (e.g., based on Elastic Search). Each solution has its pros and cons, but in both cases replica and federations can be set up, with vertical and horizontal scaling, thus creating a large data store with some indices. In both cases, queries are performed by using NoSQL approaches via API, and many constructs of classical SQL cannot be used. In this paper, we do not perform comparisons of these two solutions, even if the ingestion processes have been tested and assessed using both solutions.
- an ETL process to ingest them: TC6.3. Creating ETL processes for automated data ingestion and data transformation
- B) IOT App process ingestion:
- IoT data is typically sent in push mode using a publisher/subscriber protocol. IoT devices are registered in an IoT Broker which is registered on the Snap4City IoTDirectory, or vice-versa. When a new IoT Device is connected to the Snap4City IoT Directory, it : i) registers the static data (the IoT Device description and data model) on the KB; and then ii) sends a command to the storage system to make a subscription to the corresponding IOT broker to receive all the new messages for the storage. In case of HBase, a specific process is set up for writing into the storage, and it is implemented by using ETL or IoT, and thus it is performed for each new entity. On the contrary, for the case of Elastic Search, a scalable NIFI Apache ingestion process has been implemented to automatically subscribe the IoT Brokers on all its devices, and feed the Elastic Search engine, thus creating the data shadow for IoT data. The IoT Applications can be used for data ingestion of different kinds of protocols using both pull and push modes. In our system, we have adopted an IoT App flow for registering the data model as an IoT Device on the IoT Broker and IoT Directory by using the Snap4City APIs; then a second IoT App flow is dedicated to register all the metadata and descriptors for modeling the new entry into the KB, which cannot be passed by the IoT Directory and broker to the KB. When a new dataset needs to be ingested, if the methodology adopted is ETL or IoTApp, it is possible to create an ad-hoc semantic mapping to connect each sensor/IoTDevice, POI, etc. to the KM4City ontology. This way, we can put the dataset and its related metadata/data in the Snap4City Knowledge Base (realized in Virtuoso).The mapping creates a set of RDF Triples based on the KM4City classes and properties and then add them to the Snap4City KB. If the ingestion methodology adopted is IoTApp, the most relevant triples are automatically created by the system and added to the Snap4City ontology when each sensor is registered (the registration can be done with an easy to use web tool also in bulk), then developers are free to add other specific RDF triples via IoTApp. Developers have the freedom to describe new sensors with as many triples as possible in order to make the sensor data more reusable and more connected with all the types of resources and entities defined in KM4City. Developers are supported all the time thanks to the Living Lab and co-creation activities available on the Snap4City platform.
- IOT Device data via some IOT Broker. (see HOW TO: add a device to the Platform). The IOT Device can be added to an:
- internal Snap4City IOT Broker (a broker managed by Snap4City for security aspects). In this case, the data are immediately accessible, and you can find them into the list of your data in the Data Inspector view, for Dashboards, etc., go in the Data Inspector to search your data by GPS location, name, nature as you like.
- external Broker (a broker managed by a third organization for security aspects). This means that the IOT Device has to be accessing only with specific authentication mechanisms. See HOW TO: add IOT Device data source from external broker to the platform.
- HOW TO Create an IOT Device Model: https://www.snap4city.org/591
- HOW TO: Create an IOT Device Instance from IOT Directory tool : https://www.snap4city.org/590
- HOW TO Create as set of Devices with BulkProcessing
- additional example: https://www.snap4city.org/592
- HOW TO Develop an IOT Application for Data Ingestion https://www.snap4city.org/593
- HOW TO Upload data into Knowledge Base ServiceMap (triple upload).. please note that this valid only for Snap4City solutions installed on your on premise and not on this cloud.
- IOT Device data via some IOT Broker. (see HOW TO: add a device to the Platform). The IOT Device can be added to an: