Test Case Title
|TC5.10 - Open Street Map ingestion process (OSM2SM)|
As administrator, I can
Filling the Knowledge Base with a street graph obtained from the Open Street Map.
PostgreSQL with PostGIS extension
Osmosis simple schema database
Expected successful result
The Knowledge Base contains an up-to-date street graph imported from Open Street Map.
The following functionalities are available only for specific Snap4city users with specific privileges.
Identify and get the latest version of the Open Street Map extract of your interest.
Fill an Osmosis simple schema database reading from the Open Street Map extract.
Launch appropriate SQL scripts to prepare the data for an efficient triplification.
Launch the Sparqlify configured through an appropriate SML script to generate the triples.
Improve the resulting triple files removing the heading and duplicate lines.
Load the generated triple files to the Knowledge Base.
The process is proved to complete successfully on both the Ubuntu and the Debian Linux distributions. Anyway, it does not rely on distribution-specific features, so feel free to adopt the distribution you prefer.
The Osmosis distribution, documented at https://wiki.openstreetmap.org/wiki/Osmosis, is leveraged to read from the Open Street Map source files and write their content to a properly shaped relational database.
The Sparqlify generates the RDF triples reading from a relational database, based on a SML configuration file. Documentation and installation artefacts and instructions can be found in the main page of the project at http://aksw.org/Projects/Sparqlify.html, and on GitHub at https://github.com/SmartDataAnalytics/Sparqlify.
PostgreSQL is the recommended relational database engine for storing the Open Street Map data. Documentation and installation artefacts can be found at https://www.postgresql.org/.
PostGIS is an extension that allows an efficient and effective management of geometric and geographic data in PostgreSQL, and it is required to be installed for effectively working with the Open Street Map data. Documentation and installation artefacts can be found at https://postgis.net/.
The relational database schema that we have identified as the most appropriate is the Osmosis simple schema. The SQL scripts for the shaping of the database are part of the Osmosis. Detailed instructions can be found at https://wiki.openstreetmap.org/wiki/Osmosis/Detailed_Usage_0.43#PostGIS_Tasks_.28Simple_Schema.29.
Identify and get the latest version of the Open Street Map extract of your interest
We have identified the Geofabrik as the best source of Open Street Map extracts. It is a specialized portal where Open Street Map source files (and their compressed versions) can be found that address specific continents, countries, and regions. Also, the extracts of the OSC files are available, where the variations that occur to the Open Street Map are represented on a daily basis. You can get the Geofabrik Open Street Map extracts from download section of the portal, that can be reached at https://download.geofabrik.de/.
Fill an Osmosis simple schema database reading from the Open Street Map extract
The Osmosis distribution includes a command-line tool and a set of scripts that allow to perform a wide set of manipulations over the Open Street Map data, such as extractions, transformations, comparisons, and so on. In this use case, the tool is leveraged to fill a properly shaped relational database with the data extracted from the Open Street Map source files. Detailed instructions of how it could be performed can be found at https://wiki.openstreetmap.org/wiki/Osmosis/Detailed_Usage_0.46#--write-pgsimp_.28--ws.29.
Launch appropriate SQL scripts to prepare the data for an efficient triplification
After that the relational database has been filled with the Open Street Map data, a set of auxiliary tables must be created to speed up the following steps of the ingestion process. The creation of such tables is performed through the execution of a SQL script that can be found in our (DISIT Lab) GitHub repository at https://github.com/disit/osm2km4c/blob/master/sparqlify/install/performance_optimization.sql. The script is typically needed to be executed only once for each database, since it accesses data that tend to be immutable over the time, such as the listing of the Public Administrations that govern the territories of interest, their borders, and some country-dependent configurations.
After that, and every time that the Open Street Map data are updated in the relational database, some other optimizations are necessary. Such optimizations are to be performed executing the SQL script that can be found at https://github.com/disit/osm2km4c/blob/master/sparqlify/install/irdbcmap.sql.txt, in our (DISIT Lab) GitHub repository. At the beginning of such SQL script, a configuration section can be found. We reasonably expect that the only parameter of your real interest locates at line 25, and it is the Open Street Map unique identifier of the geographic boundary of the triplification. Indeed, even if your relational database contains, suppose, the Open Street Map data of a whole country, we do not recommend producing the triples for the whole country in a single execution. Indeed, we recommend producing the triples for one province at a time. This way, you are granted each execution to complete in a reasonable time.
Figure: The Open Street Map unique identifier for the Municipality of Helsinki is 34914
Launch the Sparqlify configured through an appropriate SML script to generate the triples
After that the data preparing outlined above has been performed, the Sparqlify can be launched to produce the triple files.
A sample invocation of the Sparqlify follows:
./sparqlify.sh -m ~/script.sml -h 192.168.0.110 -d pgsimple_fin
-U pgsimple_fin_reader -W pgsimple_fin_reader -o ntriples --dump > ~/triples.n3
A short description of the command arguments follows:
- m, the full path and file name of the SML configuration script where it is described how the data stored in the relational database should be used for generating the RDF triples. A ready-to-use script can be found at https://github.com/disit/osm2km4c/blob/master/sparqlify/install/irdbcmap.sml.txt, in our (DISIT Lab) GitHub repository;
- h, the name (or IP address) of the host where the relational database that contains the Open Street Map data (and that constitutes the Sparqlify source of data) can be found;
- d, the name of the relational database where the Open Street Map data can be found;
- U, the username to be used for authenticating to the relational database;
- W, the password to be used for authenticating to the relational database;
- o, whether the output file should include (nquads) or not to include (ntriples) the RDF graph URI;
- --dump, mandatory flag for that the generated triples could be produced in output;
- > ~/triples.n3, the full path of the file where the produced triples is stored.
Improve the resulting triple files removing the heading and duplicate lines
The first two lines of the output file that the Sparqlify generates are to be stripped away since they are heading lines and they could lead to errors at the time of loading the triple file to the graph database. Also, duplicate lines should be stripped away since they cause a useless performance degradation.
A way this can be achieved in Linux is proposed below here:
tail -n +3 sparqlify-output.n3 > no-headers.n3
sort no-headers.n3 | uniq > ready-to-use.n3
Load the generated triple files to the Knowledge Base
The bulk loading functionality of the graph database should be leveraged for loading the newly generated triples so that they could be part of the Knowledge Base. We recommend loading each province to a separate graph. We recommend removing the existing triples before loading the new ones. Specific precautions could be necessary for some graph databases. Refer to the documentation of the graph database for further details.
An even more comprehensive description of the process can be found in the article
- From the Open Street Map to the Km4City street graph