How To Dev - Design: Data Discovery


Warning message

You can't delete this newsletter because it has not been sent to all its subscribers.


An important aspect to be considered during the design phase is the Data Discovery. This activity deals with the need to identify data, entities, HLT, (historical or real-time) to be collected from the field, from the context, from external/internal, national or local databases, from satellite data, from Open Data (CKAN), or from other networks or Gov portals.

The data to be discovered are:

  • Not those generated internally by the devices/entities of the solutions.
  • those needed for the implementation of the solution and that should be accessible from outside the solution. They can be historical data or real time data connectors. For example, the contextual maps, the distribution of vehicles, the user profiles to be involved in the experiments, etc. Among them also the open data can be point of interest, POI, maps, road graph from open street map or gov databases, cost of gasoline, cost of energy, etc.

This phase has to answer at questions such as:

  • Where can I take the data, I need?
  • There is any kind of data that can be a good surrogate of the data required?
  • How they can be accessed?
  • Which kind of license they have? Are those data private/public, which licensing?
  • Is the license associated with them functional to the solution purpose or not?
  • Which information is present in these data, it fits the purpose?
  • Is the licensing compatible with the purpose?
  • Are those data ethically compliant?
  • Do I need to create a DPIA for GDPR?
  • Do I need to establish and sign an agreement with some data provider?
  • etc.

How to proceed:

  1. Data Identification is performed on the basis of the Entities identified and of the needs of Data Analytics / Transformations. The discovered data may have their own data model, and thus the model is going to be adopted in the corresponding design phase.
  2. The developers should verify if the needed data are available on the (Snap4City) platform, or they are accessible elsewhere and integrated in some manner and how, etc.
  3. If the needed data are missing the phase of Data Ingestion Development has to be addressed otherwise one could directly pass at the IOT App or to Dashboard Development.
  4. Before the Data Ingestion the data agreements have to be developed and signed (on the basis of the data licensing), verification of Data Ethics, and development of GDPR compliant procedure as the DPIA. According to the data agreement, the enforcement of rules can be performed in the Snap4City platform and business logic, if needed.

Please note that in most cases, the needed data could be surrogated with other kind of data. For example, sensors data for the environment monitoring can be surrogated with satellite data with some connections; CO2 data with traffic flow data, NO2 with traffic flow data, map data with OSM, Orthomaps there are multiple providers, open data are accessible and most of them describing the contextual situation can be recovered from different providers in local, regional, national, and international institutions, etc. etc. As the last possibility one could even buy some the data from Mobile operator, Google, TomTom, Here, Insurances, municipalities, public transportation operator, bike sharing operator, etc. etc.

The process of data discovery in most cases collects data which are associated with data sets and not only to the single Entity. The data for data sets are related to the owner, acquisition model and protocol, format, volume, rate, etc., and for the purpose a table or excel file is used, even if this information can be saved in the Digital Twin, data inspector of the platform.

Examples are provided per column. Thus, the resulted raws may have not sense.
The status refers to the ingestion process.