2023-06-02
Table of Contents
rdf2rml.sh model.ttl > model.r2rml.ttl
rdf2rml converts an RDF example with embedded tables (or SQL queries) column names into an R2RML script. R2RML is the W3C standard for RDBMS->RDF conversion. It is quite verbose and requires semantic experience to write.
rdf2rml generates R2RML transformations from examples, which saves about 15x in complexity and ensures compliance of the actual conversion to the model.
Typically the example is an rdfpuml model that uses embedded column names rather than actual attribute values.
The RDF model is a normal Turtle file with two additions.
puml:label
is used to provide the SQL source for each node
It can be one of the following:
rr:tableName
select * from
is optional and is prepended if missing.
Emitted as rr:sqlQuery
.ONCE
: replaced with select 1 from dual
,
which causes a single copy of the node to be generated
(no iteration from its “parent” node).
Use this for fixed data like skos:ConceptScheme
infopuml:label
is omitted, the node “inherits” the same query as its “parent” node.
This works only if the node has a single incoming or outgoing relation.
Inlined nodes and nodes with no properties don’t use this inheritance mechanism.rr:template
.That’s all we need to be able to generate R2RML!
Consider the following example:
<exhibition/(exhibitionid)>
puml:label """
exhibitions left join conxrefs
on id=exhibitionid
where tableid=47
and roleid=286
and exhdepartment in (53,54)
""";
a crm:E7_Activity;
crm:P2_has_type aat:300054766; # exhibition
crm:P14_carried_out_by <agent/(constituentid)>;
crm:P1_is_identified_by <exhibition/(exhibitionid)/title>;
crm:P4_has_time-span <exhibition/(exhibitionid)/date>.
<exhibition/(exhibitionid)/title> a crm:E41_Appellation;
crm:P3_has_note "(exhtitle)".
<exhibition/(exhibitionid)/date> a crm:E52_Time-Span;
crm:P3_has_note "(displaydate)";
crm:P82a_begin_of_the_begin "(beginisodate)"^^xsd:date;
crm:P82b_end_of_the_end "(endisodate)"^^xsd:date.
This creates 3 connected nodes.
All of them use the same query, inherited from the top node.
The column (exhibitionid)
is used in the URLs of all these nodes,
to ensure that correlated (linked) nodes are created for each row in the resultset.
Constant values like a crm:E7_Activity
and crm:P2_has_type aat:300054766
are emitted as is.
Variables like (displaydate)
are substituted.
As a hybrid example, (endisodate)
uses a variable value but constant datatype xsd:date
.
Now let’s see the full example test/exhibitions/exhibitions.ttl
:
The generated R2RML script is test/exhibitions/exhibitions.r2rml.ttl
.
The single node circled in red on the diagram above (and its 3 outgoing connections)
generates the following R2RML nodes: a saving of about 15x complexity!
Now let’s assume that we have the following relational data about Exhibitions (test/exhibitions/exhibitions.sql
):
conaddress
(constitutent addresses)conaddressid | constituentid | address |
---|---|---|
101 | 1 | ‘Getty Drive’ |
102 | 2 | ‘MoMA Street’ |
103 | 3 | ‘LACMA County’ |
conxrefs
(constitutent cross-references)tableid | roleid | id | constituentid |
---|---|---|---|
47 | 286 | 123 | 1 |
constituents
(constituents, in this case museums)constituentid | constituent |
---|---|
1 | ‘Getty Museum’ |
2 | ‘MoMA’ |
3 | ‘LACMA’ |
exhibitions
(exhibitions)exhibitionid | exhdepartment | exhtitle | displaydate | beginisodate | endisodate |
---|---|---|---|---|---|
123 | 53 | ‘Getty through the ages’ | ‘October 2016’ | ‘2016-10-01’ | ‘2016-10-30’ |
exhvenuesxrefs
(exhibitions at venues)exhvenxref | exhid | conid | conaddrid | approved | dispord | displaydate | beginisodate | endisodate |
---|---|---|---|---|---|---|---|---|
202 | 123 | 2 | 102 | 1 | 1 | ‘Early Oct 2016’ | ‘2016-10-01’ | ‘2016-10-15’ |
203 | 123 | 3 | 103 | 1 | 2 | ‘Late Oct 2016’ | ‘2016-10-16’ | ‘2016-10-30’ |
exhvenobjxrefs
(objects exibited at particular venue as part of an exhibition)exhvenuexrefid | objectid | catalognumber | begindispldateiso | enddispldateiso | displayed |
---|---|---|---|---|---|
202 | 1001 | ‘cat 1001’ | ‘2016-10-01’ | ‘2016-10-15’ | 1 |
203 | 1001 | ‘cat 1001’ | ‘2016-10-16’ | ‘2016-10-30’ | 1 |
202 | 1002 | ‘cat 1002’ | ‘2016-10-01’ | ‘2016-10-15’ | 1 |
If we apply the generated R2RML script on the relational data, we get this output: test/exhibitions/exhibitions-out.ttl
.
As you can see, it has the same structure as the example test/exhibitions/exhibitions.ttl
but more nodes:
riot, update
. Tested with version 3.1.0 of 2016-05-10.cat, grep, rm
This tool consists of two scripts.
rdf2rml.ru
is a SPARQL UPDATE script that transforms the model to R2RML.
It uses rr:graph
for all output, so it can be separated from the input.
It makes deterministic R2RML URLs by taking the source URLs and using this convention:
<{s}!map> for rr:TriplesMap
<{s}!subj> for rr:SubjectMap
<{s}!{p}!{o}> for rr:PredicateObjectMap
Here {s}
is the full subject URL, but {p} {o}
are stripped to their local names.
The SPARQL UPDATE has the following steps:
rr:logicalTable
for each subject node, using rr:tableName
or rr:sqlQuery
.rr:constant
rr:template
.rr:class
for each subject that has one.rr:predicateObjectMap
for each pair subject-objectrr:object
rr:objectMap
and rr:template
.rr:termType, rr:datatype, rr:language
are emitted for each object.clear default
, to preserve only the output graph rr:graph
rdf2rml.sh
is a simple shell script that
takes file prefixes.ttl
in the current folder,
prepends it to the model file,
converts it to SPARQL form, and prepends it to the SPARQL script.
Then it runs the SPARQL script,
converts the output to turtle,
removes prefixes,
and sorts the output by subject.
See test/*/Makefile
for examples how to set up make.
rdf2rml.sh
hard-codes the TMP directory and the location of Apache Jena
It also unconditionally looks for a file prefixes.ttl
in the current folder
that it prepends to the model file and the SPARQL script.
Use parentheses to designate column names in literals and URLs, instead of curly brackets as the R2RML standard (rr:template
) uses.
This makes the URLs valid, so the RDF can be validated and diagrammed.
Parentheses in literals are shown as brackets on generated diagrams, due to a limitation of rdfpuml.
Resource classes, literal datatypes and languages are limited to constant values: you cannot specify they should come from a SQL column.
You may get a Jena RIOT warning for literals with numeric or date/time datatype, because a parenthesized string is not a valid lexical form for such literals. But the conversion will work fine despite of the warning.