TC5.10 - Open Street Map ingestion process

×

Warning message

  • You can't delete this newsletter because it has not been sent to all its subscribers.
  • You can't delete this newsletter because it has not been sent to all its subscribers.

Test Case Title

TC5.10 - Open Street Map ingestion process  (OSM2SM)

Goal

As administrator, I can

Filling the Knowledge Base with a street graph obtained from the Open Street Map.

Prerequisites

Installed Linux

Installed Osmosis

Installed Sparqlify

PostgreSQL with PostGIS extension

Osmosis simple schema database

You can ask to disit lab to get an Appliance or a service.

Expected successful result

The Knowledge Base contains an up-to-date street graph imported from Open Street Map.

The following functionalities are available only for specific Snap4city users with specific privileges.

Steps

Identify and get the latest version of the Open Street Map extract of your interest.

Fill an Osmosis simple schema database reading from the Open Street Map extract.

Launch appropriate SQL scripts to prepare the data for an efficient triplification.

Launch the Sparqlify configured through an appropriate SML script to generate the triples.

Improve the resulting triple files removing the heading and duplicate lines.

Load the generated triple files to the Knowledge Base.

 


Prerequisites

The process is proved to complete successfully on both the Ubuntu and the Debian Linux distributions. Anyway, it does not rely on distribution-specific features, so feel free to adopt the distribution you prefer.

The Osmosis distribution, documented at https://wiki.openstreetmap.org/wiki/Osmosis, is leveraged to read from the Open Street Map source files and write their content to a properly shaped relational database.

The Sparqlify generates the RDF triples reading from a relational database, based on a SML configuration file. Documentation and installation artefacts and instructions can be found in the main page of the project at http://aksw.org/Projects/Sparqlify.html, and on GitHub at https://github.com/SmartDataAnalytics/Sparqlify.

PostgreSQL is the recommended relational database engine for storing the Open Street Map data. Documentation and installation artefacts can be found at https://www.postgresql.org/.

PostGIS is an extension that allows an efficient and effective management of geometric and geographic data in PostgreSQL, and it is required to be installed for effectively working with the Open Street Map data. Documentation and installation artefacts can be found at https://postgis.net/

The relational database schema that we have identified as the most appropriate is the Osmosis simple schema. The SQL scripts for the shaping of the database are part of the Osmosis. Detailed instructions can be found at https://wiki.openstreetmap.org/wiki/Osmosis/Detailed_Usage_0.43#PostGIS_Tasks_.28Simple_Schema.29.

 


Identify and get the latest version of the Open Street Map extract of your interest

We have identified the Geofabrik as the best source of Open Street Map extracts. It is a specialized portal where Open Street Map source files (and their compressed versions) can be found that address specific continents, countries, and regions. Also, the extracts of the OSC files are available, where the variations that occur to the Open Street Map are represented on a daily basis. You can get the Geofabrik Open Street Map extracts from download section of the portal, that can be reached at https://download.geofabrik.de/.

 


Fill an Osmosis simple schema database reading from the Open Street Map extract

The Osmosis distribution includes a command-line tool and a set of scripts that allow to perform a wide set of manipulations over the Open Street Map data, such as extractions, transformations, comparisons, and so on. In this use case, the tool is leveraged to fill a properly shaped relational database with the data extracted from the Open Street Map source files. Detailed instructions of how it could be performed can be found at https://wiki.openstreetmap.org/wiki/Osmosis/Detailed_Usage_0.46#--write-pgsimp_.28--ws.29.

 


Launch appropriate SQL scripts to prepare the data for an efficient triplification

After that the relational database has been filled with the Open Street Map data, a set of auxiliary tables must be created to speed up the following steps of the ingestion process. The creation of such tables is performed through the execution of a SQL script that can be found in our (DISIT Lab) GitHub repository at https://github.com/disit/osm2km4c/blob/master/sparqlify/install/performance_optimization.sql. The script is typically needed to be executed only once for each database, since it accesses data that tend to be immutable over the time, such as the listing of the Public Administrations that govern the territories of interest, their borders, and some country-dependent configurations.

After that, and every time that the Open Street Map data are updated in the relational database, some other optimizations are necessary. Such optimizations are to be performed executing the SQL script that can be found at https://github.com/disit/osm2km4c/blob/master/sparqlify/install/irdbcmap.sql.txt, in our (DISIT Lab) GitHub repository. At the beginning of such SQL script, a configuration section can be found. We reasonably expect that the only parameter of your real interest locates at line 25, and it is the Open Street Map unique identifier of the geographic boundary of the triplification. Indeed, even if your relational database contains, suppose, the Open Street Map data of a whole country, we do not recommend producing the triples for the whole country in a single execution. Indeed, we recommend producing the triples for one province at a time. This way, you are granted each execution to complete in a reasonable time.

Figure: The Open Street Map unique identifier for the Municipality of Helsinki is 34914

Launch the Sparqlify configured through an appropriate SML script to generate the triples

After that the data preparing outlined above has been performed, the Sparqlify can be launched to produce the triple files.

A sample invocation of the Sparqlify follows:

./sparqlify.sh -m ~/script.sml -h 192.168.0.110 -d pgsimple_fin
-U pgsimple_fin_reader -W pgsimple_fin_reader -o ntriples --dump > ~/triples.n3

A short description of the command arguments follows:

  • m, the full path and file name of the SML configuration script where it is described how the data stored in the relational database should be used for generating the RDF triples. A ready-to-use script can be found at https://github.com/disit/osm2km4c/blob/master/sparqlify/install/irdbcmap.sml.txt, in our (DISIT Lab) GitHub repository;
  • h, the name (or IP address) of the host where the relational database that contains the Open Street Map data (and that constitutes the Sparqlify source of data) can be found;
  • d, the name of the relational database where the Open Street Map data can be found;
  • U, the username to be used for authenticating to the relational database;
  • W, the password to be used for authenticating to the relational database;
  • o, whether the output file should include (nquads) or not to include (ntriples) the RDF graph URI;
  • --dump, mandatory flag for that the generated triples could be produced in output;
  • > ~/triples.n3, the full path of the file where the produced triples is stored.

 


Improve the resulting triple files removing the heading and duplicate lines

The first two lines of the output file that the Sparqlify generates are to be stripped away since they are heading lines and they could lead to errors at the time of loading the triple file to the graph database. Also, duplicate lines should be stripped away since they cause a useless performance degradation.

A way this can be achieved in Linux is proposed below here:

tail -n +3 sparqlify-output.n3 > no-headers.n3
sort no-headers.n3 | uniq > ready-to-use.n3

 


Load the generated triple files to the Knowledge Base

The bulk loading functionality of the graph database should be leveraged for loading the newly generated triples so that they could be part of the Knowledge Base. We recommend loading each province to a separate graph. We recommend removing the existing triples before loading the new ones. Specific precautions could be necessary for some graph databases. Refer to the documentation of the graph database for further details.

 


Read more

An even more comprehensive description of the process can be found in the article 

 

Comments

Hello

I'm having an issue at "Launch appropriate SQL scripts to prepare the data for an efficient triplification" step.

To be more specific, i renamed the SQL script https://github.com/disit/osm2km4c/blob/master/sparqlify/install/irdbcmap.sql.txt to irdbcmap.sql and replaced the OSM id inside of it. After that, i executed the SQL script with the command # psql -d pgsimple_gre -f osm2km4c/sparqlify/install/irdbcmap.sql and got the following 3 errors:
'''
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2448: NOTICE:  table "node_oneway" does not exist, skipping
DROP TABLE
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2473: ERROR:  set-returning functions are not allowed in CASE
LINE 11: else unnest(array['forward','backward'])
              ^
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2477: NOTICE:  table "way_oneway" does not exist, skipping
DROP TABLE
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2522: ERROR:  set-returning functions are not allowed in CASE
LINE 11: else unnest(array['forward','backward'])
              ^
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2526: NOTICE:  table "relation_oneway" does not exist, skipping
DROP TABLE
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2569: ERROR:  set-returning functions are not allowed in CASE
LINE 11: else unnest(array['forward','backward'])
              ^
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.
'''

Please, provide a solution since i'm not able to generate the triple file (.n3) in the next process step since the tables doesn't exist.

Thanks

Thanks for your help, i had installed Postgres 11 which i completly uninstalled.

Then i installed postgres-9.6 since 9.8.5 wasn't available in apt-get package for debian as you suggested. At the end i managed to execute the sql script without any errors.

However, on the next step, where i have to generate the triple file, i encounter a problem as a result my triple file remains empty.

Output (few of the last lines):

           select * from NodeStreetNumberRoad) a_235
      WHERE ("graph_uri" IS NOT NULL) AND ("en_id" IS NOT NULL)
    UNION ALL
      SELECT NULL::text "C_14", NULL::text "C_58", NULL::integer "C_13", NULL::text "C_57", NULL::text "C_12", NULL::text "C_56", NULL::text "C_11", NULL::text "C_55", NULL::text "C_10", NULL::text "C_54", 'http://www.disit.org/km4city/schema#Restriction'::text "C_53", NULL::integer "C_52", NULL::text "C_51", NULL::text "C_50", NULL::text "C_49", NULL::text "C_48", NULL::text "C_25", NULL::text "C_24", NULL::text "C_23", NULL::integer "C_22", NULL::integer "C_66", NULL::text "C_21", NULL::text "C_65", NULL::text "C_20", NULL::text "C_64", NULL::double precision "C_63", NULL::integer "C_62", NULL::text "C_61", NULL::text "C_60", NULL::text "C_3", NULL::text "C_5", NULL::text "C_4", NULL::text "C_7", NULL::text "C_6", NULL::text "C_19", NULL::text "C_9", NULL::text "C_18", NULL::integer "C_8", NULL::text "C_17", NULL::text "C_16", NULL::text "C_15", NULL::text "C_59", NULL::text "C_36", 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'::text "C_35", NULL::text "C_34", "graph_uri" "C_33", NULL::integer "C_32", NULL::text "C_31", NULL::text "C_30", "to_uri" "C_29", "from_uri" "C_28", NULL::text "C_27", NULL::text "C_26", NULL::text "C_47", NULL::double precision "C_46", NULL::text "C_45", NULL::integer "C_44", NULL::geometry "C_43", NULL::integer "C_42", NULL::text "C_41", NULL::text "C_40", NULL::text "C_39", NULL::text "C_38", NULL::text "C_37"
      FROM
        (
        select * from turn_restrictions) a_236
      WHERE ("from_uri" IS NOT NULL) AND ("graph_uri" IS NOT NULL) AND ("to_uri" IS NOT NULL)
) "a_237"

2021-09-08 14:02:44,438 TRACE org.aksw.sparqlify.core.sparql.QueryExecutionSparqlify: Closed connection: [HikariProxyConnection@1758876146 wrapping org.postgresql.jdbc.PgConnection@1654a892]
2021-09-08 14:02:44,441 DEBUG com.zaxxer.hikari.pool.ProxyConnection: HikariPool-1 - Executed rollback on connection org.postgresql.jdbc.PgConnection@1654a892 due to dirty commit state on close().
2021-09-08 14:02:44,441 DEBUG com.zaxxer.hikari.pool.PoolBase: HikariPool-1 - Reset (autoCommit) on connection org.postgresql.jdbc.PgConnection@1654a892

Could that be another version compatibility issue?

roottooladmin1's picture

it seems to us that you partially posted  the query, or there is some other mixed segments at the beginning or the query you reported.
I suggest you to use the KBSSM VM to start using instead of reinstalling all.

see from https://www.snap4city.org/drupal/node/471
I do not understood which kind of data you fred into and the context of your database .

it is very difficult to help you on data and context dependent problems, without knowing the details.

snap4city support

roottooladmin1's picture

See also https://www.snap4city.org/drupal/node/535

at the end:

Triplify

Loading maps in a local database and keeping them up to date is functional to the generation of RDF triples, that is what we perform in this third step. See the triplify.sh script to learn how it can be done. Note that it is the only script that you will need to customize to generate your own triples. Indeed, both the triplify.sql and triplify.sml scripts are thought to be left unaltered. It is a three-step process:

  1. Prepare data in your Postgresql database executing the triplify.sql script. In this step you must provide a parameter, again named boundary, that is the OSM ID of the OSM Relation that defines the boundary for the generation of triples. This boundary must be fully contained within the boundary that we have set during initialization;
  2. Generate RDF triples from the data that you have prepared in your Postgresql database. Use the sparqlify tool to do that;
  3. Perform some clean-up operations on the n3 file that the sparqlify produces.

Below here are some Web resources where you can learn more about tools and projects that we have met in this section of the guide:

Hello

I'm having an issue at "Launch appropriate SQL scripts to prepare the data for an efficient triplification" step.

To be more specific, i renamed the SQL script: https://github.com/disit/osm2km4c/blob/master/sparqlify/install/irdbcmap.sql.txt to irdbcmap.sql and replaced the OSM id. After that, i executed the SQL script and got the following errors:
'''
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2448: NOTICE:  table "node_oneway" does not exist, skipping
DROP TABLE
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2473: ERROR:  set-returning functions are not allowed in CASE
LINE 11: else unnest(array['forward','backward'])
              ^
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2477: NOTICE:  table "way_oneway" does not exist, skipping
DROP TABLE
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2522: ERROR:  set-returning functions are not allowed in CASE
LINE 11: else unnest(array['forward','backward'])
              ^
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2526: NOTICE:  table "relation_oneway" does not exist, skipping
DROP TABLE
psql:osm2km4c/sparqlify/install/irdbcmap.sql:2569: ERROR:  set-returning functions are not allowed in CASE
LINE 11: else unnest(array['forward','backward'])
              ^
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.
'''

Please, provide a solution since i'm not able to generate the triple file (.n3) in the next process step because table "node_oneway" doesn't exist.

Thanks