Multisensor Validation Log
Table of Contents
- Intro
- Validation
- Context Extraction Service
- Entity Linking Service
- Entity Alignment Service
- Summarization Service
- Content Alignment
- DONE Content Translation
Github Source, HTML Rendered version
Intro
From now until the end of the project I'll keep a detailed log of what I validated and defects I found. I'll use Org-mode mechanisms to track the defects (tags TODO, DONE, CANCELED) and try to monitor email conversation to keep this up to date. The numbers in brackets after a section name show the resolved vs total defects. Whenever a defect is posted in Jira, I'll track the issue number.
Please contact me on skype://valexiev1 for corrections/additions. Even better, you could edit this file on github (it's a plain text file!) and send me a pull request.
- Mistakes or imprecise info (eg the scope of a particular service is described wrong, or a particular issue is in fact a non-issue)
- When an issue is resolved
- When an issue should be canceled
I'll also attend weekly calls related to RDF data and validation, and update this.
IMPORTANT: Just because an issue is listed in the section for a particular service, doesn't necessarily mean that service created the defect!
Queries
Below I also track questions related to queries. But when agreed, they should be moved to gdoc Multisensor SPARQL Queries.
Or we could simply use the doc: if you need a new query, or a query needs optimization, please write a comment in red, and notify me in a gdoc comment (add +vladimir@sirma.bg to the comment).
Validation
Context Extraction Service
TODO Crawler to decode HTML entities
It would be good if the crawler decodes HTML entities before storing in dc:subject (possibly also dc:title, dc:description). Eg:
dc:subject "Europäische Union, ISIN_FR0003500008, Brexit, Erholungskurs, Europa, Paris, Großbritannien" , "Economy, Business & Finance" ;
DONE Keywords vs Category
Victor: both of the following fields extracted by the crawler are mapped to dc:subject:
- category: eg "Economy, Business & Finance"
- keywords: eg "Europaische Union, ISIN_FR0003500008, Brexit, Erholungskurs, Europa, Paris, Grossbritannien"
Should we separate them in different properties?
Vladimir:
IPTC is the Global Standards Body of the News Media industry.
IPTC Media Topics is a list of 1100 Media Topics developed as an extension from earlier IPTC Subject Codes.
You can explore interactively and will see that Multisensor categories are similar to these topics/subjects;
so we'll continue to map them to dc:subject
.
On the other hand, Multisensor keywords are free keywords that describe a lot more specific things.
We'll map them to schema:keywords
, defined as
"Keywords or tags used to describe the content. Multiple entries in a keywords list are typically delimited by commas":
dc:subject "Economy, Business & Finance"; schema:keywords "Europaische Union, ISIN_FR0003500008, Brexit, Erholungskurs, Europa, Paris, Grossbritannien";
It would be nice to:
- Split keywords on ", " and and emit as separate values
- (don't split categories, since the 3 words really represent one category)
- Map our categories to IPTC Media Topics. This is quite harder
DONE Ingest Timestamp
Almost done, only the datatype is missing.
Victor: introduce the date when the article have been processed in RDF. In order to keep track of which and when the “curated selected data” have been processed, and match them with the current version of the CEP service.
Vladimir: we can use dct:issued
for this purpose:
ms-content:b3f35 dc:date "2016-06-20T18:45:07.000+02:00"^^xsd:dateTime ; # date crawled dct:issued "2016-06-30T12:34:56.000+02:00"^^xsd:dateTime ; # date processed by pipeline and ingested to GDB
Boyan will add this second timestamp in the POST method.
DONE SIMMO Quality
Status:
- DONE: initial implementation
- DONE: make dqv:value nominal (eg ms:accuracy-low) instead of numeric (eg 1)
- DONE: use QualityAnnotation instead of QualityMeasurement
Victor: the field "c_quality" is sent now. Values can be:
- 0 = no quality assigned
- 1 = high quality
- 2 = medium quality
- 3 = low quality
- 5 = curated
Vladimir:
- Instead of a numeric scale (which is not monotonically increasing), let's use a nominal (symbolic) scale.
- Instead of 0, we should omit the statement
- There is no value 4
- About value 5: do we have metadata who & when curated it? Should we record in RDF something more than the number?
The selected quality ontology (DQV, see below) allows to record a lot of detail: who, when, according to what methodology…
- Leszek (skype:letschke): we do not provide any additional meta data.
Vladimir: I searched for a quality property on LOV, couldn't find anything really appropriate:
- http://www.w3.org/ns/dcat#dataQuality: this is about datasets, but is deprecated: "This should not be used to describe the data collection characteristics, other more specialized statistical properties can be used instead". But I don't see such statistical properties
- http://def.seegrid.csiro.au/isotc211/iso19115/2003/metadata#dataQualityInfo: this is about ISO 19115 "Geographic information — metadata" http://def.seegrid.csiro.au/isotc211/iso19115/2003/dataquality is a whole separate module on Quality
- http://purl.oclc.org/NET/ssnx/ssn#qualityOfObservation: this is about Semantic Sensor Networks. It makes reference to resultQuality in ISO 19156 "Geographic information — Observations and measurements"
Finally from a link at Europeana DQC, I found the
W3C Data Quality Vocabulary dqv:
.
We'll use that vocabulary, and the Linked Data Quality Dimensions ldqd:
by Zaveri.
CANCEL Represent as QualityMeasurement
Initially I tried this representation. But after discussion at mailto:public-dwbp-comments@w3.org, it was clarified that QualityMeasurement can only be used with literal values, so this is WRONG. See next section for the correct representation.
First we add a dqv:Metric
to the Multisensor ontology:
@prefix dqv: <http://www.w3.org/ns/dqv#> . @prefix ldqd: <http://www.w3.org/2016/05/ldqd#> . ms:accuracy a dqv:Metric; skos:prefLabel "Accuracy"@en; skos:definition "Degree to which SIMMO data correctly represents real world facts."@en; dqv:inDimension ldqd:semanticAccuracy; dqv:expectedDataType ms:Accuracy. ms:Accuracy a owl:Class, skos:ConceptScheme; rdfs:label "Accuracy values"@en. ms:accuracy-low a ms:Accuracy, skos:Concept; skos:inScheme ms:Accuracy; skos:prefLabel "Low accuracy"@en. ms:accuracy-medium a ms:Accuracy, skos:Concept; skos:inScheme ms:Accuracy; skos:prefLabel "Medium accuracy"@en. ms:accuracy-high a ms:Accuracy, skos:Concept; skos:inScheme ms:Accuracy; skos:prefLabel "High accuracy"@en. ms:accuracy-curated a ms:Accuracy, skos:Concept; skos:inScheme ms:Accuracy; skos:prefLabel "Manually curated"@en; skos:note "Highest accuracy"@en.
Then for each SIMMO that has a quality rating (SIMMOS that don't have a rating get no extra statements):
ms-content:b3f35 dqv:hasQualityMeasurement ms-content:b3f35-quality. ms-content:b3f35-quality a dqv:QualityMeasurement ; dqv:isMeasurementOf ms:accuracy; dqv:value ms:accuracy-curated.
DONE Represent as QualityAnnotation
The correct way to use nominal values is to use QualityAnnotation instead of QualityMeasurement.
First we define the nominal values in the ontology:
@prefix dqv: <http://www.w3.org/ns/dqv#> . @prefix ldqd: <http://www.w3.org/2016/05/ldqd#> . ms:Accuracy a skos:ConceptScheme; rdfs:label "Accuracy values"@en. ms:accuracy-low a skos:Concept; skos:inScheme ms:Accuracy; skos:prefLabel "Low accuracy"@en. ms:accuracy-medium a skos:Concept; skos:inScheme ms:Accuracy; skos:prefLabel "Medium accuracy"@en. ms:accuracy-high a skos:Concept; skos:inScheme ms:Accuracy; skos:prefLabel "High accuracy"@en. ms:accuracy-curated a skos:Concept; skos:inScheme ms:Accuracy; skos:prefLabel "Manually curated"@en; skos:note "Highest accuracy"@en.
Then for every SIMMO with a quality rating:
ms-content:b3f35 dqv:hasQualityAnnotation ms-content:b3f35-quality. ms-content:b3f35-quality a dqv:QualityAnnotation; dqv:inDimension ldqd:semanticAccuracy; oa:motivatedBy dqv:qualityAssessment; oa:hasTarget ms-content:b3f35; oa:hasBody ms:accuracy-curated.
Quality Queries
Querying for SIMMOs with quality is easy:
select * { ?simmo a foaf:Document; dqv:hasQualityAnnotation/oa:hasBody ?quality}
An update query to migrate old QualityMeasurement data to the new QualityAnnotation representation:
delete { graph ?graph { ?simmo dqv:hasQualityMeasurement ?quality. ?quality ?p ?v}} insert { graph ?graph { ?simmo dqv:hasQualityAnnotation ?quality. ?quality a dqv:QualityAnnotation; dqv:inDimension ldqd:semanticAccuracy; oa:motivatedBy dqv:qualityAssessment; oa:hasTarget ?simmo; oa:hasBody ?value}} where { graph ?graph { ?simmo dqv:hasQualityMeasurement ?quality. ?quality dqv:value ?value}}
CANCEL Missing Authors
select * {?x a foaf:Document} # 112k SIMMOs select * {?x a foaf:Document; dc:creator ?y} # 10.5k authors, only 9.4%
- Can we get more authors?
Discussion MULTISENSO-186:
- Andrey: Interest in the "genre", “author” feature if available (not always retrievable by the context extraction service)
- Ioannis: The genre and author information can only be extracted when they are available in the HTML content of the scrapped page, otherwise we cannot infer it. To this end, there is not much we can do. It was quite obvious on the planning stage that not all articles have mentioned fields in HTML tags, and it could be foreseen, maybe with additional parsing methods since not always html tags do have this information, but normally all news articles have for example Author info in the body of the article. So this part can be closed, as "it" will not going to happen.
CANCEL Genre (Type)
Vladimir & Ioannis: by Genre, do you mean dc:type? We assume so below.
select * {?x a foaf:Document; dc:type ?y} # 20k, that's 17.9%
- Q: Can we get more genres?
- A: Same comments as in the previous section apply.
Distribution of Genre:
select ?y (count(*) as ?c) {?x a foaf:Document; dc:type ?y} group by ?y order by desc(?c)
Genre/Type | Count | Notes |
---|---|---|
article | 14768 | |
music | 2886 | |
website | 1087 | |
speech | 813 | |
sound | 407 | |
food | 83 | ?? Maybe "recipe" |
video | 25 | |
Article | 25 | normalize to "article" |
single | 11 | |
song | 11 | |
Speech | 11 | |
Ogg | 11 | normalize to "sound" |
video.other | 7 | normalize to "video" |
news | 6 | |
ARTICLE | 4 | normalize to "article" |
media | 2 | |
blog | 1 | |
slideshow | 1 | |
video.movie | 1 | normalize to "video" |
tumblr-feed:entry | 1 |
- Vladimir: it would be nice to normalize some values, and reduce this from 20 to say 10 rows
- Ioannis: the code to extract "type" was written over a year ago, so this would not be so simple
- Vladimir and Ioannis: the first 5 types catch 90%, so it's only the "long tail" would need normalization… This is not so important
DONE Wrong prefix for Text Characteristics
Advanced Context Extraction adds new Text Characteristics properties (ms:technicality, ms:fluency, ms:richness
)
to the context (property definitions omitted for brevity).
Example (this particular text is fluent, but is neither rich nor technical, so we set values 5, 1, 1 respectively):
@base <http://data.multisensorproject.eu/content/9e9c304>. <#char=0,2000> a nif:Context; nif:isString "This is the whole text of the SIMMO.\n It should continue for 2000 chars but I'll stop here"@en; ms:fluency 5.0; ms:richness 1.0; ms:technicality 1.0.
0e6b24-CONTEXT_EXTRACTION-22-6-2016.ttl and Text Characteristics (technicality, fluency, richness):
: checked- currently use http://data.multisensor.eu/ontology#
- but the correct prefix is http://data.multisensorproject.eu/ontology#
Victor
: updated the prefixDONE Refresh Prefixes
I've added http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-annotation# to ./img/prefixes.ttl (committed ) Please refresh from ./img/prefixes.ttl, so new validation files use this prefix.
Gerard: what exactly are we suppposed to do with the prefixes file.
- Once loaded in the repo (Boyan's job), we can make queries without mentioning the prefixes.
Should we load it into Sesame somehow so that triples are generated with prefixes? If so, could you give us some code showing how to do it?
- I think the validation files use prefixes, because prefixes.ttl is prepended, then passed through RIOT. I think Victor made that script. The other prefixes are there, so it's just a matter of refreshing
Entity Linking Service
TODO Underscores to Spaces
The EL service emits Babelnet entity labels in up to 4 languages, eg
bn:s00088614v skos:prefLabel "zu_befriedigen"@de , "satisfacer"@es , "satisfaire"@fr , "задоволи"@bg . bn:s00014609n skos:prefLabel "Kuchen"@de , "Pastel_(gastronomía)"@es , "Gâteau"@fr , "Торта"@bg . bn:s01718102n skos:prefLabel "I_do_not_want_what_I_haven't_got"@es , "I_Don't_Want_What_I_Haven't_Got"@en , "I_Do_Not_Want_What_I_Haven't_Got"@fr . bn:s02229586n skos:prefLabel "UHC_Hamburg"@en , "Uhlenhorster_HC"@fr .
For reasons unknown, Babelnet uses underscores (eg see UHC_Hamburg_n_EN). I think we should convert the underscores to spaces to make the label more natural.
- Can be fixed in the repo with a query like this
- However this doesn't find Babelnet broader entities imported by ONTO. We have a list of all Babelnet entities, maybe better to use this somehow
delete {?x skos:prefLabel ?lab} insert {?x skos:prefLabel ?lab1} where { ?x skos:prefLabel ?lab filter exists {[its:taIdentRef ?x; nif-ann:taIdentProv <http://babelfy.org/>]} filter(regex(?lab,"_")) bind(replace(?lab,"_"," ") as ?lab1) }
- UPF code that brings in new Babelnet enrichments should be fixed too. Gerard: DONE Entitly Linking: Translations obtained from BabelNet are now issued as literals of skos:prefLabel triples without underscores.
- UPF code that creates
nif:lemma
should be fixed, eg this node has lemma "basic_data" http://multisensor.ontotext.com/resource/ms-content/00a17bdb91543c45349f42378caeecd434c1a8f4#char=281,291 Gerard: DONE . Dependency parsing: lemmas are also emitted as literals of nif:lemma triples without underscores - IMPORTANT
- bn:s03113558n: "Royal Ordnance Factories F.C."@en and "Royal_Ordnance_Factories_F.C."@en
- bn:s00124949n: "Prefijo del Segmento de Programa"@es and "Prefijo_del_Segmento_de_Programa"@es
Now we have two labels per BN concept per language. Must remove the superfluous ones. Eg:
QUE Remove Disambiguation Labels?
Should we also remove disambiguations, which are trailing parenthesized parts, eg "Pastel_(gastronomía)" -> "Pastel"? Since these labels are not used for NLP tasks, and the disambiguations are very useful for understanding what the entity is, I vote to leave them.
Entity Alignment Service
0181e1-ENTITY_ALIGNMENT-21-6-2016.ttl and alignment.log (by email)
: checkedThe log has 90 actions. I checked these 4 actions:
2016-06-21 16:22:18 INFO Alignment:42 - Comparing <#char=1453,1461> and <#char=1444,1461> 2016-06-21 16:22:18 INFO Alignment:138 - Removed: (#char=1453,1461, rdf:type, nif:Phrase) 2016-06-21 16:22:18 INFO Alignment:152 - Removed: (#char=1453,1461, itsrdf:taClassRef, null) 2016-06-21 16:22:18 INFO Alignment:156 - Removed: (#char=1453,1461, itsrdf:taIdentRef, null)
This corresponds to two annotations:
- <#char=1444,1461> found by Named Entity Recognition: "Margaret Thatcher" detected as the politician, with link to DBpedia (longer; correct)
- <#char=1453,1461> found by Entity Linking: "Thatcher" detected as a "roof builder" with link to Bbelnet (shorter; incorrect)
The Entity Alignment service prefers the longer annotation, and removes 3 properties from the shorter annotation. What is left in the RDF is this:
<#char=1453,1461> a nif:Word ; <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-annotation#taIdentConf> "0.0"^^xsd:double ; <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-annotation#taIdentProv> <http://babelfy.org/> ; nif:anchorOf "Thatcher" ; nif:beginIndex "1453"^^xsd:nonNegativeInteger ; nif:endIndex "1461"^^xsd:nonNegativeInteger ; nif:referenceContext <#char=0,2898> . <#char=1444,1461> a nif:Phrase ; nif:anchorOf "Margaret Thatcher" ; nif:beginIndex "1444"^^xsd:nonNegativeInteger ; nif:endIndex "1461"^^xsd:nonNegativeInteger ; nif:referenceContext <#char=0,2898> ; its:taClassRef nerd:Person ; its:taIdentRef dbr:Margaret_Thatcher . dbr:Margaret_Thatcher a foaf:Person , dbo:Person , nerd:Person ; foaf:name "Margaret Thatcher" .
DONE Also remove taIdentConf, taIdentProv
In the example above, taClassRef
and taIdentRef
were removed.
This makes the other two props nif-ann:taIdentConf
and nif-ann:taIdentProv
useless.
Remove them too.
DONE Leave Dependency Links
Entity Alignment also seems to remove the dependency links, eg:
<#char=1444,1452> nif:dependency <#char=1453,1461> <#char=1444,1452> upf-deep:deepDependency <#char=1453,1461>
However, this can make the dependency and FrameNet graphs disconnected. So leave the dependencies alone.
TODO Use Prefixes in alignment.log
I shortened the excerpt from alignment.log above to improve readability: substituted the defined prefixes, and used the SIMMO URL as base (i.e. used relative URLs starting with hash) It would be very useful if alignment.log uses the same shortenings to improve readability.
This is a completely cosmetic issue, we can cancel it.
Summarization Service
2c9d5c-CONCEPT_EXTRACTION-16-6-2016.ttl (concept_with_scores)
looked at<#char=0,11> a nif:Phrase ; <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-annotation#taIdentConf> "0.0"^^xsd:double ; <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-annotation#taIdentProv> <http://babelfy.org/> ; nif:beginIndex "0"^^xsd:nonNegativeInteger ; nif:dependency <#char=29,38> ; nif:endIndex "11"^^xsd:nonNegativeInteger ; nif:lemma "open_source" ; nif:literalAnnotation "surf=spos=NN" , "rel==dpos=NN|end_string=11|start_string=0|id0=1|number=SG|word=open_source|connect_check=OK|vn=open_source" , "deep=spos=NN" ; nif:oliaLink upf-dep-syn:NAME , upf-deep:NAME , <#char=0,11_fe> , penn:NNP ; nif:referenceContext <#char=0,5625> ; upf-deep:deepDependency <#char=29,38> ; its:taClassRef ms:GenericConcept ; its:taIdentRef bn:s01157392n .
DONE nif:anchorOf
I've been saying all along to skip nif:anchorOf
so as not to create too many literals.
But with the number of SIMMOs loaded, it has not been too taxing for GDB.
nif:anchorOf
has been instrumental in debugging, eg of the UTF-8 and offset mismatch issues.
nif:literalAnnotation
and nif:lemma
provide sufficient info about the phrase,
so maybe we don't need nif:anchorOf
. We could cancel this issue.
- Gerard: If they can be sustained by GraphDB, I vote in favor of keeping them as they help a lot when debugging.
- Vladimir: so decided: if the Entity Lookup makes a new node, add
nif:anchorOf
to it.
Some nodes are missing nif:anchorOf
, eg see
http://multisensor.ontotext.com/resource/ms-content/00a17bdb91543c45349f42378caeecd434c1a8f4#char=281,291.
This has nif:lemma "basic_data" but not nif:anchorOf.
Gerard
NIFUtils: new annotations created by services using this library will now emit anchors. This affects EL mostlyTODO Why nif-ann:taIdentConf is 0?
In the above example, nif-ann:taIdentConf
is 0.
In many other examples it's a good number, eg see below.
Is 0 some sort of bug, or does Babelfy actually return 0 confidence for some concepts?
- Gerard: I think it's an error, I'll get back to you as soon as as I've figured out what's causing it.
- Vladimir
<3c361-de#char=8190,8198> a nif:Phrase , nif:Word ; nif-ann:taIdentConf "0.0"^^xsd:double ; nif-ann:taIdentProv <http://babelfy.org/> ; nif:anchorOf "Zugleich" ;
Still occurs, eg
bf6fe4-CONCEPT_EXTRACTION-16-6-2016.ttl
@base <http://data.multisensorproject.eu/content/bf6fe48b8d88c1d11d5086863f4c3ad26286bda9>. <#char=1814,1822> a nif:Word ; nif-anno:taIdentConf "0.7619547411890493"^^xsd:double ; nif-anno:taIdentProv <http://babelfy.org/> ; nif:anchorOf "pastries" ; nif:beginIndex "1814"^^xsd:nonNegativeInteger ; nif:dependency <#char=1806,1812> ; nif:endIndex "1822"^^xsd:nonNegativeInteger ; nif:lemma "pastry" ; nif:literalAnnotation "deep=spos=NN" , "rel==member=A2|dpos=NN|end_string=1822|start_string=1814|id0=29|word=pastry|number=PL|connect_check=OK|fn=Food" , "surf=spos=NN" ; nif:oliaLink upf-deep:COORD , penn:NNS , <#char=1814,1822_fe> , upf-dep-syn:COORD ; nif:referenceContext <#char=0,12793> ; upf-deep:deepDependency <#char=1806,1812> ; its:taClassRef ms:GenericConcept ; its:taIdentRef bn:s00060957n .
CANCEL ms:GenericConcept vs ms:SpecificConcept
- Gerard (about the last example): A 'generic' concept produced by Babelfy.
But annotations of concepts produced by the concept extraction service should contain triples pointing to
ms:SpecificConcept
. - Vladimir: also seems to me that concepts like "open source" and "pastry" are
ms:SpecificConcept
. - Gerard: problems regarding the quality of the annotations shouldn't be part of the RDF validation.
- Vladimir: agree, but this log is for the project as a whole, not just syntactic validity. (Which doesn't mean I'm determining priorities!)
- Gerard: we'll be releasing updates to the concept extraction service, so expect improvements into what is considered a specific concept.
- How is this used in the UI? Gerard thinks that only Specific concepts are (or should be) shown in the SIMMO's tag cloud
Gerard
This should become a non issue after recent changes are incorporated to the concept serviceDONE Optimize Summarization Queries
Gerard wrote some of the Summarization queries are slow.
Please mark which ones need optimization, and provide $graph
for testing.
- Used the standard notation
$param
to indicate an input parameter, rather than__PARAM__
- Moved
FILTER
insideGRAPH
, and a few more minor changes - The problem was that the prop path
p1?/p2
is slow, sincep1?
connects any node to itself. Replaced withp1/p2|p2
, which is fast
Content Alignment
The Content Alignment Pipeline (CAP) is a service that executes on KB data and finds articles that are similar or contradictory to the source article. It is not executed as part of the SIMMO pipeline, but periodically.
: selected repo multisensor-test and checked- Everything's done except "ms:score instead of fise:confidence" (Babis) and "add to Ontology" (Vladimir).
- Checked there are motivations of both kinds:
select ?mot (count(*) as ?c) where { graph <http://data.multisensorproject.eu/CAP> {?x oa:motivatedBy ?mot} } group by ?mot
ms:linking-similar: 828, ms:linking-contradictory: 860
- one contradictory CAP annotation is CAP/007e1c1e-85b4-481a-a838-0e242c2afb8c
It talks about these two:
PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> select ?desc ?text { values ?x { <http://data.multisensorproject.eu/content/7d76ef5e787e830b081d149d05359c21cc5a9835> <http://data.multisensorproject.eu/content/d2b116e9c5422fda256da2913738ac000ba7b30c> } ?x dc:description ?desc. ?y nif:sourceUrl ?x; nif:isString ?text }
- One is about "How an Apple Watch can ruin your life"
- The other about "Employees and executives win mobility and flexibility with the SH10 TaskBook of SOREDI Touch Systems GmbH"
Guess this is sort of contradictory: one hates one IT product, the other one praises another IT product :-)
CAP Old Model
CAP:_Specification_of_the_service. It proposes the following model:
checked<http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#similarArticles> a oa:Annotation ; oa:hasTarget <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304> ; oa:hasBody <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#similarArticle-1> , <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#similarArticle-2> ; oa:motivatedBy oa:tagging ; oa:annotatedBy <http://data.multisensorproject.eu/agent/CAPAgent> ; oa:annotatedAt "2016-01-11T12:00:00"^^xsd:dateTime . <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#similarArticle-1> a oa:SemanticTag ; skos:related <http://data.multisensorproject.eu/content/ca34bb35770bfa55434a0689d64e1e6a60611047> ; fise:confidence 0.862 . <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#similarArticle-2> a oa:SemanticTag ; skos:related <http://data.multisensorproject.eu/content/57e07befbda355c2eca2ee521926071ee9f5c719> ; fise:confidence 0.795 . <http://data.multisensorproject.eu/agent/CAPAgent> a prov:SoftwareAgent ; foaf:name "Content Alignment Pipeline v1.0" .
Each annotation is symmetric, so it's written twice: in the SIMMO graphs of each of the two SIMMOs. This complicates data management, because both of these graphs need to be updated.
DONE One Annotation Per Pair Babis
After consultation with Babis, we decided to change the representation as follows:
- Write annotations in their own graph http://data.multisensorproject.eu/CAP, outside of any SIMMO graph. The CAP service will be called periodically, search globally in the SIMMO DB, and overwrite the similarity graph.
- Write one annotation per pair
- Use custom
oa:motivatedBy
:ms:linking-similar
vsms:linking-contradictory
to express similarity vs contradiction
In the previous example, assume that the first related article is similar but the second is contradictory.
We restructure it as follows, where similarity/123
and similarity/124
are GUIDs or some other way to generate unique URLs.
Please note that the representation is completely symmetric regarding the two SIMMOs being linked,
so there's no need to repeat for the other SIMMO.
graph <http://data.multisensorproject.eu/CAP> { <http://data.multisensorproject.eu/CAP/123> a oa:Annotation; oa:hasBody <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304>, <http://data.multisensorproject.eu/content/ca34bb35770bfa55434a0689d64e1e6a60611047>; fise:confidence 0.862; oa:motivatedBy ms:linking-similar; oa:annotatedBy <http://data.multisensorproject.eu/agent/CAP>; oa:annotatedAt "2016-01-11T12:00:00"^^xsd:dateTime . <http://data.multisensorproject.eu/CAP/124> a oa:Annotation; oa:hasBody <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304>, <http://data.multisensorproject.eu/content/57e07befbda355c2eca2ee521926071ee9f5c719>; fise:confidence 0.795; oa:motivatedBy ms:linking-contradictory; oa:annotatedBy <http://data.multisensorproject.eu/agent/CAP>; oa:annotatedAt "2016-01-12T12:00:00"^^xsd:dateTime . }
TODO Use ms:score not fise:confidence for CAP
In an example sent by Babis, I see fise:confidence=1.6439653807554948. But confidence is the probability that something is true, so it should be <=1. Guess this is some other sort of score, and maybe it's better to use our own property?
Decided with Babis to use a custom property ms:score
(see next).
TODO Add to Ontology Vladimir
The following will be in ./img/ontology.ttl, so they don't need to be repeated by CAP for every similarity link.
<http://data.multisensorproject.eu/agent/CAP> a prov:SoftwareAgent; foaf:name "Content Alignment Pipeline v1.0". ms:linking-similar a owl:NamedIndividual, oa:Motivation; skos:inScheme oa:motivationScheme; skos:broader oa:linking; skos:prefLabel "linking-similar"@en; rdfs:comment "Motivation that represents a symmetric link between two *similar* articles"@en; rdfs:isDefinedBy ms: . ms:linking-contradictory a owl:NamedIndividual, oa:Motivation; skos:inScheme oa:motivationScheme; skos:broader oa:linking; skos:prefLabel "linking-contradictory"@en; rdfs:comment "Motivation that represents a symmetric link between two *contradictory* articles"@en; rdfs:isDefinedBy ms: . ms:score a owl:DatatypeProperty; rdfs:domain oa:Annotation; rdfs:range xsd:decimal; rdfs:label "score"@en; rdfs:comment "Strength of an Annotation, eg the link between two entities"@en; rdfs:isDefinedBy ms: .
CAP Query
Given a $simmo
, find similar or contradictory articles, and their similarity/contradiction scores.
select ?article ?motivation ?score { [a oa:Annotation; oa:annotatedBy <http://data.multisensorproject.eu/agent/CAP>; oa:hasBody $simmo, ?article; ms:score ?score; oa:motivatedBy ?motivation ] filter ($simmo != ?article) }
CANCEL Other CAP Queries
The gdoc maybe has 2 queries related to CAP. Not sure I'm looking at the right section. Maybe we should just delete them.
- 2.8 "Retrieve the concepts in the SIMMO (Select)": wrote something simple
- 2.9 "Retrieve the concepts in the SIMMO (Construct)": don't know what is needed
DONE Content Translation
Scenario: we have a SIMMO in original language ES that is machine-translated to EN & DE.
- All textual elements are translated: title, description, body.
- The example below also shows subject & keywords in different languages. If you can only produce them in EN, that's fine
- However, video ASR text is not translated.
- Both original and translations are annotated with NIF.
We want to record all NIF information against original and translated separately, so there's no confusion. If the article includes multimedia, we want to attach it only to the original, to avoid data duplication.
Solution: we need separate roots (foaf:Document), so we store the original and translation(s) in separate named graphs.
- the translated-content graph has language-specific suffix; the original-content graph does not have such suffix
- the translated content has link
bibo:translationOf
to the original
Root:
# ES original graph ms-content:156e0d { ms-content:156e0d a foaf:Document ; dbp:countryCode "ES" ; dc:creator "Alberto Iglesias Fraga" ; dc:date "2016-07-28T23:45:07.000+02:00"^^xsd:dateTime ; dc:description "SONY ha iniciado negociaciones con Murata Manufacturing para la venta de su negocio de baterías. La operación culminará en marzo de 2017..."; dc:language "es" ; dc:source "cloud.ticbeat.com" ; dc:subject "Economía, Negocios y Finanzas" ; dc:title "SONY se desprenderá de su negocio de baterías" ; dc:type "article" ; schema:keywords "Sony, baterías, Murata Manufacturing"; dct:source <http://feedproxy.google.com/~r/rwwes/~3/z2KuGYx6FiY/> . # EN translation graph ms-content:156e0d-en { ms-content:156e0d-en a foaf:Document ; bibo:translationOf ms-content:156e0d; # IMPORTANT! dbp:countryCode "ES" ; dc:creator "Alberto Iglesias Fraga" ; dc:date "2016-07-28T23:45:07.000+02:00"^^xsd:dateTime ; dc:description "SONY has begun negotiations with Murata Manufacturing for the sale of its battery business. The operation will culminate in March 2017 ..." ; dc:language "en" ; dc:source "cloud.ticbeat.com" ; dc:subject "Economy, Business & Finance" ; dc:title "SONY is clear from its battery business" ; dc:type "article" ; schema:keywords "Sony, batteries, Murata Manufacturing"; dct:source <http://feedproxy.google.com/~r/rwwes/~3/z2KuGYx6FiY/> . } # DE translation graph ms-content:156e0d-de { ms-content:156e0d-de a foaf:Document ; bibo:translationOf ms-content:156e0d; # IMPORTANT! dbp:countryCode "ES" ; dc:creator "Alberto Iglesias Fraga" ; dc:date "2016-07-28T23:45:07.000+02:00"^^xsd:dateTime ; dc:description "SONY hat die Verhandlungen mit Murata Manufacturing für den Verkauf seiner Batterie-Geschäft begonnen. Die Operation wird März 2017 gipfeln ..." ; dc:language "de" ; dc:source "cloud.ticbeat.com" ; dc:subject "Economy, Business & Finanzen" ; dc:title "SONY ist klar von seiner Batteriegeschäft" ; dc:type "article" ; schema:keywords "Sony, Batterien, Murata Manufacturing"; dct:source <http://feedproxy.google.com/~r/rwwes/~3/z2KuGYx6FiY/> . }
Context:
# ES original graph ms-content:156e0d { <http://data.multisensorproject.eu/content/156e0d#char=0,2131> a nif:Context ; ms:fluency "1.22"^^xsd:double ; ms:richness "1.86"^^xsd:double ; ms:technicality "2.78"^^xsd:double ; nif:beginIndex "0"^^xsd:nonNegativeInteger ; nif:endIndex "2131"^^xsd:nonNegativeInteger ; nif:isString "SONY se desprenderá de su negocio de baterías\n\nSONY sigue inmersa en la profunda reestructuración..." ; nif:sourceUrl ms-content:b156e0d . } # EN translation graph ms-content:156e0d-en { <http://data.multisensorproject.eu/content/156e0d-en#char=0,1800> a nif:Context ; ms:fluency "1.25"^^xsd:double ; # hopefully will be similar to original, but won't be identical ms:richness "1.81"^^xsd:double ; ms:technicality "2.70"^^xsd:double ; nif:beginIndex "0"^^xsd:nonNegativeInteger ; nif:endIndex "1800"^^xsd:nonNegativeInteger ; # Assuming EN comes out shorter than ES nif:isString "SONY is clear from its battery business\n\nSONY still immersed in deep restructuring ..." ; nif:sourceUrl ms-content:b156e0d-en . } # EN translation graph ms-content:156e0d-de { <http://data.multisensorproject.eu/content/156e0d-de#char=0,2200> a nif:Context ; ms:fluency "1.12"^^xsd:double ; # hopefully will be similar to original, but won't be identical ms:richness "1.56"^^xsd:double ; ms:technicality "2.41"^^xsd:double ; nif:beginIndex "0"^^xsd:nonNegativeInteger ; nif:endIndex "2200"^^xsd:nonNegativeInteger ; # Assuming DE comes out longer than ES nif:isString "SONY ist von seiner Batterie-Geschäft\n\nSONY klar immer noch in einer tiefen Umstrukturierung getaucht ..." ; nif:sourceUrl ms-content:b156e0d-de . }
Some NIF annotations:
# ES original graph ms-content:156e0d { <http://data.multisensorproject.eu/content/156e0d#char=1199,1224> a nif:Phrase ; nif:anchorOf "batería de iones de litio" ; nif:beginIndex "1199"^^xsd:nonNegativeInteger ; nif:endIndex "1224"^^xsd:nonNegativeInteger ; nif:referenceContext <http://data.multisensorproject.eu/content/b156e0d#char=0,2131> ; nif-ann:taIdentConf "1.0"^^xsd:double ; nif-ann:taIdentProv <http://babelfy.org/> ; its:taClassRef ms:GenericConcept ; its:taIdentRef bn:s01289274n . } # The BN labels are submitted in the SIMMO graph but stored in the default graph, thus the same for all languages bn:s01289274n skos:prefLabel "LiIon"@de, "Li-ion cell"@en, "Batteries lithium-ion"@fr, "Литиево-йонна батерия"@bg . # EN translation graph ms-content:156e0d-en { <http://data.multisensorproject.eu/content/156e0d-en#char=1100,1119> a nif:Phrase ; nif:anchorOf "lithium ion battery" ; nif:beginIndex "1100"^^xsd:nonNegativeInteger ; nif:endIndex "1119"^^xsd:nonNegativeInteger ; nif:referenceContext <http://data.multisensorproject.eu/content/b156e0d-en#char=0,1800> ; nif-ann:taIdentConf "1.0"^^xsd:double ; nif-ann:taIdentProv <http://babelfy.org/> ; its:taClassRef ms:GenericConcept ; its:taIdentRef bn:s01289274n . } # DE translation graph ms-content:156e0d-de { <http://data.multisensorproject.eu/content/156e0d-en#char=1200,1218> a nif:Phrase ; nif:anchorOf "Lithium-Ionen-Akku" ; nif:beginIndex "1200"^^xsd:nonNegativeInteger ; nif:endIndex "1218"^^xsd:nonNegativeInteger ; nif:referenceContext <http://data.multisensorproject.eu/content/b156e0d-de#char=0,2200> ; nif-ann:taIdentConf "1.0"^^xsd:double ; nif-ann:taIdentProv <http://babelfy.org/> ; its:taClassRef ms:GenericConcept ; its:taIdentRef bn:s01289274n . }
Multimedia is only present in the original-content graph:
graph ms-content:156e0d { ms-content:156e0d dct:hasPart <http://cloud.ticbeat.com/2016/07/sony-baterías-explosión.mp4>, <http://cloud.ticbeat.com/2016/07/sony-batería.jpg> <http://cloud.ticbeat.com/2016/07/sony-baterías-explosión.mp4> a dctype:MovingImage; dc:format "video/mp4". <http://cloud.ticbeat.com/2016/07/sony-batería.jpg> a dctype:StillImage; dc:format "image/jpeg". }
Files:
- old: http://grinder1.multisensorproject.eu/cepfiles/multiTest/validated/
- new: http://grinder1.multisensorproject.eu/cepfiles/multiTest/41d28d-GERMAN_SIMMMO-5-9-2016.ttl and http://grinder1.multisensorproject.eu/cepfiles/multiTest/41d28d-CONTEXT_EXTRACTION-5-9-2016.ttl
Checking:
- added prefixes
bibo: upf-pos-spa: upf-dep-deu: upf-pos-deu:
(before we only hadupf-dep-spa:
). Please update. - Would also be nice to translate
schema:keywords
(currently is "preisportal, vergleich, check24, verivox, billiger, idealo, geizhals, tricks, betrug, geld" in both DE original and EN translation) - All the rest is ok
TODO BabelNet Multilingual Labels
The Entity Linking Service can find BN concepts in any language, because BN concepts have multilingual labels. This is unrelated to whether the SIMMO is translated to another language or not. It's up to the UI to filter BN labels and show only labels in the selected language.
However, not all BN concepts have labels in all MS languages. Therefore the UI should implement some language fall-back logic:
- if no label in the selected language is available, show the EN label
- TODO Boyan: check whether all BN concepts have EN labels.
CANCEL Add Spanish BabelNet Labels
Gerard, as we saw above, Babelfy recognizes "batería de iones de litio" but bn:s01289274n does not have ES label (has EN, FR, DE, BG). This concept has ES representation on the BN site. Do you need to add Spanish labels?
- Gerard : The EL service is using the 2.5.1 release of BabelNet to fetch the translations, as these cannot be retrieved using Babelfy's API, so it is possible that some synset senses have translations in the latest version online (suppose it matches 3.7) which aren't found in 2.5.1.
- Vladimir: When we got complete BN data (for the concepts found by you plus their ancestors), we used BN online.
So http://babelnet.org/rdf/s01157392n should include all needed labels. (But right now it is down: "Error parsing configuration file file://c:/home/debian/storage/apache-tomcat-7.0.65/webapps/rdf/WEB-INF/config.ttl: Error making the query, see cause for details")
Translated-content Queries
Find Entities
All queries for a single language (eg find concepts) won't change. Eg here is how to find entities in the original and translated SIMMOs (hopefully the same entities will be recognized). We just change the graph:
# ES Original select distinct ?concept ?label { graph ms-content:156e0d {[] itsrdf:taIdentRef ?entity}. ?entity skos:prefLabel ?label } # EN Translation select distinct ?concept ?label { graph ms-content:156e0d-en {[] itsrdf:taIdentRef ?entity}. ?entity skos:prefLabel ?label }
Select Single Label
We can filter to entity labels in the selected language, implementing "language fall-back".
coalesce()
picks the first bound variable.
We don't need it for EN since we assume all BN concepts have EN.
# ES Original select distinct ?concept (coalesce(?labelEs,?labelEn) as ?label) { graph ms-content:156e0d {[] itsrdf:taIdentRef ?entity}. ?entity skos:prefLabel ?labelEn filter(lang(?labelEn)="en") optional {?entity skos:prefLabel ?labelEs filter(lang(?labelEn)="es")} } # EN Translation select distinct * { graph ms-content:156e0d-en {[] itsrdf:taIdentRef ?entity}. ?entity skos:prefLabel ?label filter(lang(?labelEn)="en") }
Find Translations
To find all translations of a SIMMO we make a query across graphs (i.e. in the default graph, which is a union of all graphs).
Remember that $
indicates a query paramter (while ?
indicates a free variable):
select * { ?translatedSIMMO bibo:translationOf $SIMMO; dc:language ?translatedLang}
To find all translations in a given language:
select * { ?translatedSIMMO bibo:translationOf ?SIMMO; dc:language $lang}
To find all original SIMMOs (not translations):
select * {?simmo filter not exists {?simmo bibo:translationOf []}}
To check whether a given SIMMO is a translation:
ask {$simmo bibo:translationOf []}