Linked Open Data for Cultural Heritage

Vladimir Alexiev

Ontotext Webinar, 2016-09-29

2D presentation (O for overview, ? for help). Continuous HTML. PDF. Publications

1 Intro

  • A bit about me: co-founder of Sirma Group Holding, Bulgaria's largest software group and parent company of Ontotext
    • 30y in IT: 8 at university, 22 in industry
    • Did plenty of project management, business analysis and data modeling, some big projects too
    • Last 8 years focused on data modeling and integration
    • Last 6 years in paricular, focused on semantic data and semantic integration
  • I love to poke in other people's data and get in-depth. So there's a lot about data in these slides
  • See My publications: you can sort by type and keyword, full abstracts are available.
    • I've provided a few references below, but if a topic interests you, please search in the publications
  • The shorter version has about 110 slides, so sit back, relax, and enjoy the ride. Should take us 1:20h
    • Ask questions at any time in the chat, I'll answer them all at the end
  • This longer version has 130 slides, including info about Library metadata and ontologies

1.1 GLAM vs Internet

GLAM, CH, DH?

  • Cultural Heritage (CH): the sum of our non-economic heritage
    • Obvious implications to economically significant sectors, eg tourism
    • Some say it's the source of all creativity, would you agree?
    • Includes old and new (eg digitally-born), material and immaterial, tangible and intangible, permanent and temporal (eg interactive installations)
  • Galleries, Libraries, Archives, Museums (GLAM): sisterhood of institutions that care for our CH, each with its own perspective and priorities
  • Digital Humanities (DH): the use of computers in the humanities.
    • Eg some UK universities with DH programs: @KingsDH @UCLDH @DH_OU @CamDigHum

1.2 Google NGrams: Phrases in Books

Search for "library, museum" vs "Google, Facebook, Twitter" in books: the web sites are negligible

google-books-ngrams.png

1.3 Google NGrams: Two Specific Orgs

Compare two specific orgs: "Facebook" is more popular in recent books, compared to "British Museum" over time

google-books-BM-facebook.png

1.4 Google Trends: Search Popularity

Web searches over the last 12 years: "Facebook, Google" are much more popular than "library, museum"

google-search-trends.png

1.5 How To Survive in the Internet Age?

Since ancient times GLAMs have been the centers of knowledge and wisdom

  • Aren’t Google, Wikipedia, Facebook, Twitter and smart-phone apps becoming the new centers of research and culture (or at least popular culture)?
  • Will GLAMs fall victims to teenagers with smartphones browsing Facebook? If the library's attitude is "Come search in our OPAC" then certainly yes
  • How to preserve the role of GLAMs into the new millennium?

To survive, GLAMs must adopt the internet as their default modus operandi

  • Web 1.0: presentation
  • Web 2.0: interaction
  • Web 3.0 (semantic web): data linking, enriching/disambiguating text using NLP/IE approaches

1.6 Why Linked Open Data (LOD) is Important

  • Culture is naturally cross-institutional, cross-border, multilingual, and interlinked
  • LOD allows making connections between (and making sense of) the multitude of digitized cultural artifacts available on the net
  • LOD enables large-scale Digital Humanities research, collaboration and aggregation; technological renewal of CH institutions

CH-linking.png

2 GLAM Content Standards

GLAM data is complex and varied

  • Exception is the rule
  • Many metadata format variations
  • Data comes from a variety of systems

Thus professional organizations have found it useful to define content standards

  • Describe what data to capture (and sometimes how to go about it)
  • Before formalizing how to express it in machine-readable form

Examples are extremely useful for data modelers to decide how to map the data

2.1 Museum Content Standards

Cataloging Cultural Objects: content standard for art, architecture, museums

CCO-cover.jpg

2.1.1 CCO Example: Artwork and Creator Record

CCO-example-Marco-Ricci.png

2.1.2 CCO Example: Hierarchical Link Between 2 Artworks

CCO-example-chartres-portal.png

2.1.3 CCO Example: Creator Extent

How to describe one aspect of the data

CCO-example-creator-extent.png

2.1.4 SPECTRUM

spectrum-logo.jpg UK Museum Collections Management Standard

  • Defines procedures for museums to follow, and the attendant data
  • Covers 21 procedures: Pre-entry, Object entry, Loans in, Acquisition, Inventory control, Location and movement control, Transport, Cataloguing, Object condition checking and technical assessment, Conservation and collections care, Risk management, Insurance and indemnity management, Valuation control, Audit, Rights management, Use of collections, Object exit, Loans out, Loss and damage, Deaccession and disposal, Retrospective documentation
  • Addresses accreditation

2.1.5 SPECTRUM Example: Object Entry

SPECTRUM-object-entry.png

2.2 Archival Content Standards

  • ISAD(G): archival materials
  • ISAAR(CPF): agents (corporations, people, families)
  • ISDF: functions (eg Secretary of some society)
  • ISDIAH: archival holding institutions

Image by D.Pitti, 2015

ICA-standards-timelines.png

2.3 Library Content Standards

  • AACR2 (Anglo-American Cataloging Rules 2)
  • International Standard Bibliographic Description (ISBD)
  • Resource Description and Access (RDA)

Extremely detailed and comprehensive (see RDA later). But sometimes pay more attention where to put the commas than to:

  • Data sharing
  • Global availability of resources
  • Sharing the cataloging burden

2.3.1 FRBR, FRSAD, FRAD

Functional Requirements for Bibliographic Records (FRBR), Subject Authority Data (FRSAD), Authority Data (FRAD) (J.Mitchell, M.Zeng, M.Zumer, 2011)

FRBR-FRAD-FRSAD.jpg

2.3.2 FRBR

Starts from user tasks (find, identify, select, obtain, explore). Introduces the important 4-level WEMI model (relates to Uniform Titles):

  • Work: original or derived intellectual work (eg Don Quixote)
  • Expression: translation or edition (eg Don Quixote translation to English)
  • Manifestation: publisher's work (eg with illustrations, foreword by, compilation…). ISBNs are here
  • Item: physical copy: libraries track loan/availability; famous copies (eg Lincoln's Bible); manuscripts are singleton items

2.3.3 FRSAD

Anything can be subject (thema), referred to by various names/titles (nomen)

FRSAD.png

2.3.4 FRBR-LRM

FRBR-Library Reference Model (P.Riva, P.Le Bœuf, M.Žumer, Draft for World-Wide Review 2016-02). Merges the previous standards

FRBR-LRM.png

3 GLAM Metadata Schemas

How many of the standards listed in Seeing Standards: A Visualization of the Metadata Universe apply to your work? (by Jenn Riley, Associate Dean for Digital Initiatives at McGill University Library)

GLAM-seeing-standards.png

3.1 Seeing Standards (2)

GLAM-seeing-standards-full.png

3.2 XML Schemas

Do you deal with XML? I bet you do

  • XML Schema (XSD): most widely used, but most unwieldy
  • RelaxNG (RNG): new generation schema language
  • RNG Compact (RNC): non-XML notation, most readable. Eg EAD3 is mastered in RNC, then RNG and XSD produced
  • Schematron: express rules in XPath that can't be captured in XSD/RNG/RNC (eg cross-field validation)

Tools:

RNC-flymake.png

3.3 Museum Metadata: CDWA

Categories for the Description of Works of Art (CDWA): realization of CCO, 532 "categories" (data elements).

CDWA-sections.png

3.3.1 CDWA Lite

XML schema implementing part of CDWA. Moderate complexity, about 300 elements. Display vs Indexing (structured) elements, eg for Dimension.

CDWA-data-outline.png

3.3.2 CONA Schema

Cultural Objects Name Authority (CONA): Getty museum data aggregation. Moderate complexity, about 280 elements:

CONA-data-outline.png

3.3.3 SPECTRUM XML

SPECTRUM Schema 4.0b has 10 entities and 592 fields, of which 490 are Object (artwork) fields. I am not aware of any systems producing this.

SPECTRUM-object-data-outline.png

3.3.4 LIDO

Lightweight Information Describing Objects (LIDO). Evolved from CDWA, museumdat, with inspiration from CIDOC CRM. (Images by R.Stein and A.Vitzthum, ATHENA workshop, 2010)

LIDO-data-outline.png

3.3.5 LIDO Schema

  • Complex schema, eg when referring to a related object, you can provide almost as much detail as for the main object. Could leverage opportunities for linking more.
  • Display vs Indexing (structured) elements: inherited from CDWA

LIDO-example.png

3.4 Archive Metadata

  • EAD: Encoded Archival Description. Describes archival materials (documentary units)
  • EAC/CPF: Encoded Archival Context: Corporations, Persons, Families
  • EAG: Encoded Archival Guide. Describes institutions

3.4.1 Archive Metadata Problems

Pay a lot of attention to presentation, not enough to linking (difficult to "semanticize"). Emphasis on documents, not historic agents and events

  • EAG: So-called "controlled access points" are text, and typically not controlled at all
  • EAC: Many institutions don't consider EAC very valuable, and instead put person info in EAD's bioghist element (example below from EADiva)
  • EAC: Related persons are names ("strings"), not links ("things")
  • EAC: Events include lots of info but only Date is separate field (person names could be tagged but often are not)
  • EAC: Family tree modeled as Outline, that's also used for other purposes (just presentation)
<bioghist>
  <head>Chronological Events</head>
  <chronlist>
    <chronitem>
      <date normal="19781028">October 28, 1978</date>
      <event>
        <persname normal="Wossname, Samuel">Sam Wossname</persname> succeeds
        <persname normal="Othername, John">John Othername</persname> as department head.
      </event>
    </chronitem>
    <chronitem>
      <date normal="19790315">March 15, 1979</date>
      <event>Departmental reorganization.</event>
    </chronitem>
  </chronlist>
</bioghist>

3.5 Library Metadata: MARC

MARC is 50 years old, unreadable, and doesn't accommodate new FRBR principles. MARC-XML is not much better

MARC.png

3.5.1 MARC Must Die

A whole emotional subculture, based on a slogan by Roy Fielding, 2002.

Presentation by Sally Chambers, ELAG 2011

MARC-must-die.png

4 GLAM Ontologies

Why do they call conversion to RDF "lifting" and back to some other format "lowering"?

  • RDF is a simple abstracted data model
  • Doesn't have nesting biases like XML: whether a sub-element is nested or referenced by ID. Has less syntactic idiosyncrasies
  • (RDF/XML is awful, but there is Turtle for readability, or JSONLD for programmer convenience)
  • The model is self-describing in a distributed way: if a class/property is looked up, should return description and info

4.1 Europeana Data Model

Model used by the Europeana aggregator (53M objects), and adopted by Digital Public Library of America (DPLA) Based on:

  • OAI ORE (Open Archives Initiative Object Reuse & Exchange): organizing object metadata and digital representations (WebResources)
  • Dublin Core: descriptive metadata
  • SKOS (Simple Knowledge Organization System): conceptual objects (concepts, agents, etc)
  • CIDOC-CRM inspired: events, some relations between objects

Europeana-classes.png

4.1.1 EDM Semantic Graph

graph-LevskiOrdinance.png

4.1.2 EDM Issues/Considerations

  • Criticized that it's not expressive enough. Eg can't capture the specific contribution of an artist to artwork
  • Complication: splits info about an object:
    • EDM External (form provider): edm:ProvidedCHO and ore:Aggregation
    • EDM Internal (at Europeana): edm:ProvidedCHO and 2 <ore:Aggregation, ore:Proxy> pairs
  • Many providers use the minimal features and make mistakes; Europeana didn't do a lot of validation
    • Old objects retro-converted from ESE are poor (only text), though some enrichments added by Europeana
    • Europeana Data Quality Committee formed, to push this strategic point (2015-2020)

Evolving specification (since 2009)

  • Currently considering actual implementation of Events
  • Extensions for manuscripts, music, fashion, etc

4.2 CIDOC CRM

CIDOC CRM: comprehensive reference model used for history, historic events, archaeology, museum data, etc by CIDOC (ICOM documentation committee). Standardized as ISO 21127:2014, still evolving. About 85 classes, fundamental branches: Persistent (endurant) vs Temporal (perdurant), Physical vs Conceptual

cidoc_class_hierarchy.jpg

4.2.1 CIDOC CRM Properties

Classes represent abstract things (eg crm:E24_Physical_Man-Made_Thing), specific things (eg Paintings, Coins) are accommodated with crm:P2_has_type. 135 props (plus their inverses); prop hierarchy (see "- - -" at bottom):

CIDOC-prop-hierarchy.png

4.2.2 CIDOC Graphical Examples

cidoc-graphical-measurement.jpg

4.3 Web Annotation (Open Annotation, OA)

W3C TR: mark, annotate, relate any web resources, eg: Webpage and bookmark, Image and region over it, Document and translation, Paragraph and commentary. Diagram of Complete Example from spec (using my rdfpuml)

OA-eg44.png

4.4 International Image Interop Framework (IIIF)

Standard API for DeepZoom (hi-res) images. Supported by many servers and viewers. http://iiif.io

IIIF-showcase.png

4.4.1 IIIF Presentation API

Based on OA and SharedCanvas. Strong attention to JSONLD representation (convenient for developers). Allows to assemble manuscripts from pieces, present folios, etc etc. See Rob Sanderson presentations, eg IIIF and JSONLD:

IIIF-sharedCanvas.png

4.5 Library Ontologies

War of the Bibliographic Ontologies?

  • BIBO: used for a long time, pragmaic
  • FRBRer: pragmatic realization of FRBR, but little uptake (not rich enough?)
  • FRBRoo: based on CIDOC CRM, perhaps too complex
  • Fabio, Cito, Doco and friends: modern, includes new features (eg citation intent)
  • BibFrame: sponsored by LoC, but soundly criticized for modeling mistakes
  • RDAregistry.info: basic FRBR classes, numerous properties for all kinds of things. Used for 100M records at TEL
  • SchemaBibEx (http://bib.schema.org): steps on a clean model sponsored by the big 4 search engines (Google, MS Bing, Yahoo, Yandex.ru). Developed by OCLC. May end up being used for 300M records at WorldCat.

4.5.1 RDAregistry

Resource Description and Access (RDA). Registry info is well organized

RDAregistry.png

4.5.2 RDAregistry Properties

Many props (306 for Work alone), for specific purposes (eg "apellee" for court decisions, "granting institution" for academic theses). Numeric prop names, but lexical (natural language) also supported. Serves many semantic formats.

RDAregistry-Work.png

4.5.3 A Taste of FRBRoo

EDM–FRBRoo Application Profile Task Force: asked what to add to EDM to better fit FRBRoo.

  • TF members developed a number of examples, eg on publications of "Don Quixote" (T.Aalberg, V.Alexiev, J.Walkowska).

EDM variant:

bima0000007198.edm.png

4.5.3.1 A Taste of FRBRoo

Simpler FRBRoo variant:

bima0000007198.png

4.5.3.2 A Taste of FRBRoo

More complex FRBRoo variant:

bima0000007198.JW.png

4.5.4 FRBR-Inspired

  • "FRBR, Before and After" by K.Coyle (ALA 2016) is an in-depth look at FRBR-inspired models/realizations.
  • Chapter 10 describes the following ontologies: FRBRer, FRBRcore, FaBiO, <indecs>, BIBFRAME, RDA in RDF, webFRBRer, FRBRoo
  • "Mistakes have been made", K.Coyle, SWIB 2015

FRBR-mistakes.png

4.5.5 British Library Data Model

Pragmatic data model that reuses several ontologies, and adds own props

BL-model-serial.png

4.5.6 First Library That Runs on RDF

Oslo Public Library (http://data.deichman.no, since 2014) uses Koha open source software, RDF in the core, and marc2rdf/rdf2marc conversions. Pragmatic data model that reuses several ontologies, and adds own props. Enables a number of agile apps, eg search related books on Kiosk

NO-Oslo-RDF-model.jpeg

4.5.6.1 Oslo Public Library Data
d_res:tnr_749919  rdf:type  bibo:Document , fabio:Manifestation ;
  dc:title  "About time" ;
  d:titleURLized  "about_time" ;
  fabio:hasSubtitle  "Einstein's unfinished revolution" ;
  ctag:tagged  d_keyword:imaginary , d_keyword:dilation , d_keyword:time , 
    d_keyword:tidsreiser , d_keyword:tidsdilatasjon ;
  foaf:depiction  <http://covers.openlibrary.org/b/id/96714-M.jpg> ,
    <http://covers.openlibrary.org/b/id/96715-M.jpg> ,
    <http://www.bokkilden.no/SamboWeb/servlet/VisBildeServlet?produktId=81081> ;
  owl:sameAs  <http://purl.org/NET/book/isbn/0140174613#book> ,
    <http://www4.wiwiss.fu-berlin.de/bookmashup/books/0140174613> ;
  dc:language  lexvo:eng ;
  d:bibliofilID  "931138" ;
  dc:format  <http://data.deichman.no/format/Book> ;
  d:location_signature  "Dav" ;
  dc:publisher  d_org:penguin ;
  bibo:numPages  "316" ;
  d:physicalDescription  "fig." ;
  d:bibsubject  d_subject:einstein_albert , d_subject:tid_metafysikk ;
  fabio:isManifestationOf  d_work:x24918900_about_time ;
  d:signatureNote  "07x0619gq" ;
  d:bindingInfo  <http://data.deichman.no/bindingInfo/h> ;
  d:bsID  "0181541" ;
  dc:description  "Bibliografi: s. 293-294"@no ;
  d:priceInfo  "Nkr 170.00" ;
  foaf:isPrimaryTopicOf  <http://www.goodreads.com/book/show/286461> ,
    <http://www.librarything.com/work/23493> ;
  dc:identifier  "749919" ;
  d:dewey  "115" , "530.11" ;
  d:location_dewey  "530.11" ;
  bibo:isbn  "9780140174618" , "0140174613" ;

4.6 Archival Ontologies

3 attempts to represent EAD as RDF, but IMHO neither is very good.

  • Eg "The Semantic Mapping of Archival Metadata to the CIDOC CRM Ontology" (Journal of Archival Organization, 9:174–207, 2011) proposes to represent the EAD levels hierarchy (from Fonds down to Items) as five parallel CRM hierarchies

Records in Context (RiC): new upcoming semantic standard by ICA

  • Addresses the scope of EAD, EAC, EAG in one framework. Inspired by national standards, FRBR (FRBR-LRM), CIDOC CRM
  • Progress report (2015), Mlist for comments
  • Conceptual Model 1.0 (Sep 2016): Document key components of archival description, properties of each, relations between them
  • Ontology: after finalizing the Conceptual Model, Expressed in OWL, will include semantic mapping to similar concepts developed by related communities

4.6.1 RiC Sample Network

RiC-example.png

5 GLAM LOD Datasets (LODLAM)

  • Some established thesauri and gazetteers as LOD, some are interconnected: DBPedia; Wikidata, VIAF, FAST, ULAN; GeoNames, Pleiades, TGN; LCSH, AAT, IconClass, Joconde, SVCN, Wordnet, etc.
  • Not shown: large collection LODs like: Europeana (EDM), British Museum (CIDOC CRM), YCBA (CIDOC CRM), Rijksmuseum (EDM)
  • (Diagram based on work by M.Hildebrand)

Culture-datacloud-pretty.png

5.1 Wikidata

Tons of info on everything, including GLAMs, artists, artworks, etc. Eg Frans Hals on Reasonator

WD-FransHals.png

5.1.1 Wikidata Genealogy

Family tree of Barack Obama

WD-Obama-familyTree.png

5.1.2 Sum of All Paintings

Wikidata Project Sum of All Paintings. Data used for:

  • Works by painter across collections (catalogue raisonné). Eg Frans Hals

WD-FransHals-painings.png

5.1.3 Crotos

Excellent image search. Shows links to WD, Wikimedia Commons, original website. Eg Frans Hals on Crotos

WD-FransHals-Crotos.png

5.1.4 You can help too!

Hunting for missing inventory numbers (9.9k of 140k). Important because <collection, inventory number> is used to identify the painting. Eg US (1k), Getty Museum (2)

WD-Getty-missing-invNo.png

5.1.5 Let's fix the second one

Find it on Getty's site, add the info like this:

WD-Getty-portrait.png

5.1.6 Histropedia

Timelines of everyting. Eg paintings by Leonardo

histropedia-Leonardo.png

5.2 VIAF

Virtual International Authority File: 20 national libraries, 10 other contributors including Getty ULAN and Wikidata. Eg coreferencing cluster of Spinoza:

viaf-spinoza.png

5.2.1 VIAF vs Wikidata (2015)

VIAF-Wikidata-comparison.png

5.3 Global Authority Control

  • 201307 Authority Addicts: The New Frontier of Authority Control on Wikidata, Wikimania 2013
  • 201501 Wikidata Project Authority Control (initiated by Ontotext)
  • 201503 Name Data Sources for Semantic Enrichment study for Europeana of datasets including Person/Organization names. Conclusions:
    • The best datasets to use for name enrichment are VIAF and Wikidata
    • There are few name forms in common between the "library-tradition" datasets (dominated by VIAF) and the "LOD-tradition datasets" (dominated by Wikidata)
    • VIAF has more name variations and permutations, Wikidata has more multilingual names (translations)
    • VIAF is much bigger: 35M persons/orgs. Wikidata has 2.7M persons and maybe 1M orgs
    • Only 0.5M of Wikidata persons/orgs are coreferenced to VIAF, with maybe another 0.5M coreferenced to other datasets, either VIAF-constituent (eg GND) or non-constituent (eg RKDartists)
    • A lot can be gained by leveraging coreferencing across VIAF and Wikidata
    • Wikidata has great tools for crowd-sourced coreferencing

5.3.1 Names of Lucas Cranach

Analyzed records of Lucas Cranach in 7 LOD datasets (Wikidata: Freebase, DBpedia, Yago; VIAF: ISNI, ULAN).

Cranach-venn.png

5.3.2 Wikidata Coreferencing can Enlarge VIAF

Cranach-corefs.jpg

5.3.3 Mix-n-Match

A global Authority on everything: librarian's dream come true! Mix-n-Match is a collaborative tool to create coreferences. 234 authorities, including Getty AAT, TGN, ULAN; RKD artists, works; LoC Authorities; VIAF (not in M-n-M but on WD); BM persons; BBC YourPaintings; Artsy, etc etc

WD-MnM-catalogs.png

5.3.3.1 You can help with Authorities too!

Eg checking matches to Getty AAT. Single sign-on, a click per item. Easy!

WD-MnM-AAT.png

6 LODLAM Projects

GLAM and DH projects present a bewildering variety, eg

  • Publishing Vocabularies/Thesauri as LOD
  • Publishing Museum collections and National Bibliographies as LOD
  • Enrichment of GLAM metadata with relevant thesauri, semantic and faceted search
  • Study of artistic influence over time and space
  • Literary traditions, parallel editions
  • Poetic repertories
  • Studying manuscripts, stematology (manuscript derivation)
  • Historiography
  • Studying charters, prosopography ("micro biographies"). "Prosopography is Greek for Facebook", SNAP:DRGN project, 2015

Research functions and sometimes integrated into Virtual Research Environments

6.1 Mellon "Space" Projects

The Andrew Mellon Foundation funds many projects in CH and DH, and a few software projects, including:

  • CollectionSpace: museum collection management
  • ArchiveSpace: archive management
  • ResearchSpace: semantic integration based on CIDOC CRM, search, data & image annotation, data basket, etc
  • ConservationSpace: line of business application for conservation specialists

6.2 ResearchSpace

Executed by the British Museum. Ontotext developed the first prototype (2010-2013). Semantic Search

RS-search-paper-from-London.png

6.2.1 ResearchSpace Search

Powerful and precise search: Drawings by Rembrandt that are about Mammals

RS-search-Rembrandt.png

6.2.2 ResearchSpace Search: Fundamental Relations

First implementation experience of the CIDOC CRM Fundamental Relations approach

RS-search-FR-matrix.png

6.2.3 ResearchSpace Search: One FR (Thing from Place)

RS-search-FR-thing-from-place.png

6.2.4 ResearchSpace Search: Implementation

120 GraphDB rules, weaved using Literate Programming approach. Inference dependencies between props (text=input, gray=intermediate, white=output)

RS-search-implementation-deps.png

6.2.5 ResearchSpace Search: New Implementation

(Not Ontotext work). Watch the video (D.Oldman)

RS-search-new.png

6.2.6 ResearchSpace Data Annotation

RS-data-annotation.png

6.2.7 ResearchSpace Data Annotation Model

RS-data-annotation-model.png

6.2.8 Image Annotation

RS-image-annotation.png

6.2.9 Image Annotation Model

RS-image-annotation-model.png

6.2.10 Image Annotation Architecture

RS-image-annotation-arch.png

6.3 British Museum (BM) and YCBA LOD

  • GraphDB runs the BM SPARQL endpoint. One of the biggest CH RDF collections (917M triples)
  • As part of RS, developed mapping of BM data (2M objects) with BM, using CIDOC CRM
  • This mapping was followed by the Yale Center for British Art (YCBA)
  • Mapping Documentation: very comprehensive but is monolithic and has imprecisions. Includes the (in)famous diagram

BM-mapping-doc.png

6.4 ConservationSpace

Executed by a consortium led by US National Gallery of Art. Developed by Sirma ITT (Ontotext sibling). Based on Ontotext GraphDB (semantic metadata), Alfresco (document management), Smart Documents (Sirma product).

ConservationSpace.png

6.5 Europeana LOD and OAI PMH

Ontotext crated and hosted the Europeana SPARQL and OAI PMH services

O is for Open (CultJam 201507).png

6.5.1 Europeana Statistics

Eg chart of newspapers (several millions) by year: can't do this using the Europeana API, but is easy with SPARQL

EDM-chart-EuropeanaNewspapers.png

6.6 Europeana Food and Drink

Food & Drink content, semantically enriched (place and FD topic). EFD Semantic App: open data, SPARQL endpoint, open source (Github). Uses GraphDB and ElasticSearch enterprise connector

EFD-semapp.png

6.6.1 Tasty Bulgarian Recipes

Eg 150 with beer, including pancakes!

EFD-Beer-Pancake.png

6.6.2 Wide Geographic Coverage

Objects from the Roman Empire to Antarctica (Scott's expedition to the South Pole), and everything in-between

EFD-Antarctica.png

6.6.3 EFD Enrichment: FD Gazetteer

Use Wikipedia Categories to extract a FD Gazetteer.

  • "Domain-specific modeling: Towards a Food and Drink Gazetteer", Tagarev, A.; Tolosi, L.; and Alexiev, V, LNCS 9398, p182-196, January 2016 (preprint)

EFD-cats1.png

6.6.4 EFD Enrichment: Pruning FD Category Tree

EFD-cats2.png

6.6.5 EFD Enrichment: French

Selected French as second enrichment language after English, considering category overlap (work by L.Tolosi, x-axis is cat level), available content, NLP capabilities

EFD-cats3.jpg

6.6.6 EFD Place Enrichment

We used standard Ontotext Concept Enrichment Service, which is a mix of DBpedia+Wikidata. But also had to add Geonames, to leverage the place hierarchy

EFD-places1.png

6.6.7 EFD Place Enrichment

Hierarchical semantic facet based on Geonames

EFD-places2.png

6.6.8 EFD Geographic Mapping: Clustering

Once we have places, it's relatively easy to map them. We used the Cluster Mapper library

EFD-geo-clusters.png

6.6.9 EFD Geographic Mapping: Jittering

There are 9k objects marked "Bulgaria". We don't want all flags in the center of Bulgaria, so we jitter them up

EFD-geo-Place-jitter.png

6.6.10 GLAMs Working With Wikidata

Why should GLAMs bother about Wikidata? Because it gives an excellent way to connect and expose your collection data to a multilingual audience

  • Europeana Wikimedia Taskforce report:
    • Recommendation 1: For every Europeana project, considering the possible benefits of a Wikimedia component should be default behavior
    • Recommendation 7: Make Wikidata a central element of Europeana's "portal to platform" strategy
    • Recommendation 8: Europeana should continue to invest in technology that improves the interoperability between GLAMs and Wikimedia platforms
  • GLAMs Working with Wikidata: easily add content about a colorful tradition "blessing of the baskets" ("swiecenie koszyczek" or just "Święconka" in Polish). With proper cats: when we merge them across languages (pl, en, de), we discover the content is about Food and Drink, Easter, and a Polish tradition

Blessing_of_the_baskets_Easter_tradition.jpg

6.7 Getty Vocabulary Program LOD

GVP well-known and respected in GLAM. Dependencies: AAT-TGN-ULAN-CONA. Center of LODLAM cloud? GVP Training Materials (Diagram by J.Cobb, 2014)

GVP-linked.png

6.7.1 GVP LOD Releases

AAT 2014-02, TGN 2014-08, ULAN 2015-03. Publicized in blog posts by J.Cuno, head of the Getty Trust

GVP-ULAN_LOD.png

6.7.2 Ontotext Scope of Work

  • Semantic/ontology development: http://vocab.getty.edu/ontology
  • Contributed to ISO 25964 ontology (latest standard on thesauri). Provided implementation experience, suggestions and fixes
  • Complete mapping specification
  • Help implement R2RML scripts working off Getty's Oracle database, contribution to Perl implementation (RDB2RDF), R2RML extension (rrx:languageColumn)
  • Work with a wide External Reviewers group (people from OCLC, Europeana, ISO 25964 working group, etc)
  • GraphDB semantic repo, clustered for high-availability
  • Semantic application development (customized Forest user interface) and tech consulting
  • SPARQL 1.1 compliant endpoint: http://vocab.getty.edu/sparql
  • Comprehensive documentation (100 pages): http://vocab.getty.edu/doc
  • Sample queries (100), including charts, geographic queries, etc
  • Per-entity export files, explicit/total data dumps. Many formats: RDF, Turtle, NTriples, JSON, JSON-LD
  • Help desk / support on twitter and google group (see home page)
  • Presentations, papers. On the composition of ISO 25964 hierarchical relations (BTG, BTP, BTI). Alexiev, V.; Lindenthal, J.; and Isaac, A. International Journal on Digital Libraries, August 2015, Springer.

6.7.3 Complete Representation of All GVP Info

See GVP LOD: Ontologies and Semantic Representation, V.Alexiev, CIDOC 2014. External Ontologies:

Prefix Ontology Used for
bibo: Bibliography Ontology Sources
dc: Dublin Core Elements common
dct: Dublin Core Terms common
foaf: Friend of a Friend ontology Contributors
iso: ISO 25946 (latest on thesauri) iso:ThesaurusArray, BTG/BTP/BTI
owl: Web Ontology Language Basic RDF representation
prov: Provenance Ontology Revision history
rdf: Resource Description Framework Basic RDF representation
rdfs: RDF Schema Basic RDF representation
schema: Schema.org common, geo (TGN), bio (ULAN)
skos: Simple Knowledge Organization System Basis vocabulary representation
skosxl: SKOS Extension for Labels Rich labels
wgs: W3C World Geodetic Survey geo Geo (TGN)
xsd: XML Schema Datatypes Basic RDF representation

6.7.4 GVP Semantic Representation (1)

GVP-semantic-overview-1.png

6.7.5 GVP Semantic Representation (2)

GVP-semantic-overview-2.png

6.7.6 Key Values (Flags) Are Important

Excel-driven Ontology Generation™. Key val can be mapped to Custom sub-class, Custom (sub-)prop, Ontology Value (eg <term/kind/Abbreviation>)

GVP-getty-codes.png

6.7.7 Associative Relations Are Valuable

More Excel-driven Ontology Generation™

  • Relations come in owl:inverseOf pairs (or owl:SymmetricProperty self-inverse)

GVP-assoc-rels.png

6.7.8 Involved Inference of Hierarchical Relations

GVP-hierarchicalRelationsInference.png

6.7.9 Comprehensive Documentation

Getty Vocabularies Linked Open Data: Semantic Representation. Alexiev, V.; Cobb, J.; Garcia, G.; Harpring, P. Getty Research Institute, 3.2 edition, March 2015.

GVP-doc-TOC.png

6.7.10 Sample Queries (100), Integrated UI

Some charts, eg "Year Joined UN" (TGN), "Pope Reign Durations" (ULAN)

GVP-sample-queries.png

6.7.11 GVP Vocabs Usage

Collected about 100 usages of the vocabs, many in Collection Management and Search. Many described in Getty Vocabs: Why LOD? Why Now?, J.Cobb, 2014. Eg

  • AAT used in Cataloging Calculator: finds bibliographic and authority data: language codes, geographic area codes, publication country codes, AACR2 abbreviations, LC main entry, Cutter numbers, AAT concepts, etc

AAT-CatalogingCalculator.png

6.7.12 AAT in Europeana

AAT-Europeana.jpg

6.8 J.P.Getty Museum

Working with JPGM on publishing LOD. Considering CIDOC CRM, maybe also simpler ontologies. Hoping to generate R2RML from instance examples like:

GVP-objects.png

6.8.1 J.P.Getty Museum and Wikidata

Discussing making data for Wikidata. WD has 480 Getty paintings, but the Museum has 180k artworks. WD query shown as image grid

WD-Getty.png

6.9 American Art Collaborative

American Art Collaborative: 14 US art museums committed to establishing a critical mass of LOD on the semantic web. Consulting on CRM mapping.

AAC-NPG-castAfter.png

6.10 European Holocaust Research Infrastructure

EHRI is a large-scale EU project that involves 23 Holocaust archives (Europe, Israel and the US), DH and IT organizations.

  • In its first phase (2011-2015) it aggregated archival descriptions and materials on a large scale and built a Virtual Research Environment (portal) for Holocaust researchers based on a graph database.
  • In its second phase (2015-2019), EHRI2 seeks to enhance the gathered materials using semantic approaches: enrichment, coreferencing, interlinking. Semantic integration involves Four of the 14 EHRI2 work packages and helps integrate databases, free text, and metadata to interconnect historical entities (people, organizations, places, historic events) and create networks.

"Semantic Archive Integration for Holocaust Research: the EHRI Research Infrastructure", V.Alexiev, L.Brazzo, CIDOC Congress 2016.

6.10.1 EHRI: Person Networks

Research question: how person networks influenced chance of survival. Idea:

  • Rec 123456: firstName “John”, lastName “Smith”, gender Male, dateMarriage 1921-01-05, additional names nameSpouseMaiden “Matienzo”, nameSpouse “Maria Smith”, nameChild “Mike Smith”, nameSibling “Jack Jones”
  • We can create Person records for the people mentioned, make some likely inferences, then try to match to other Person records in the database

EHRI-person-networks.png

6.10.2 EHRI: Large-Scale Place Matching

Match USHMM places to Geonames, also achieving deduplication. A Geonames matching pipeline in free text was also developed

EHRI-place-matching.png

6.10.3 EHRI: Oral History Interviews

Analyze 2.5k OH Interviews:

  • ONTO: Place enrichment, Person name recognition
  • INRIA: word2vec experiments
guard Cos dist punishment Cos dist
guarding 0.593507 punishments 0.668144
sentry 0.512083 punish 0.601212
hlinka 0.496201 punishing 0.543213
gate 0.490032 beatings 0.527033
watching 0.484647 penalty 0.497262
rifle 0.484379 deserved 0.490157
lookout 0.482025 beaten 0.473870
patrol 0.477233 straf 0.473338
soldier 0.475982 offense 0.461230
guarded 0.474689 executing 0.459965
police 0.474291 merciless 0.455123
  • semantic "differencing" (interesting)

    KGB - Stalin + Hitler = SS
    
    

6.10.4 EHRI: Discovering Camps, Ghettos, Stalags

And referencing to Geonames so we can get coordinates

EHRI-camps.png

6.11 Others Projects: WikiArtHistory

Vienna University of Technology (site, paper)

  • Art History networks from Wikipedia, through VIAF id
  • Time and nationality from ULAN

wikiarthistory.png

6.12 ChartEx

NLP analysis of medieval Charters and Deeds. Funded by Digging Into Data cross-country SSH funding initiative. Visualized with BRAT

ChartEx.png

6.13 Numismatics

My good friend Ethan Gruber at the American Numismatic Society has developed a host of amazing software that uses and produces LOD.

  • Numishare: Data platform for coins/medals, 100k coin types
  • Nomisma: Shared authorities for numismatics
  • Kerameikos: Pottery LOD
  • EADitor: EAD Editor: based on XML & XForms, uses/produces LOD
  • xEAC: EAC/CPF Editor: based on XML & XForms, uses/produces LOD

6.13.1 Coins in Time and Space

Spatiotemporal distribution of hoards containing a particular Roman Republican coin type. Below: examples of this type in partner collections

numismatics-distribution.png

6.13.2 Geographic Distribution

Distribution of the Roman denarius: blue dots for mints, heatmap of finds (a lot in the UK Portable Antiquities Scheme)

numismatics-denarius.png

6.13.3 Numishare

Data platform with over 100k coin types. Powers custom collections, eg Art of Devastation: Medalic Art of the Great War

numismatics-AoD.png

6.13.4 Nomisma

Shared authorities for numismatics. Eg a mint:

numismatics-mint.png

6.13.5 CoinHoards

  • Greek coin data provided by CoinHoards.org
  • Geo mapping data provided by nomisma.org
  • Below: reference to the coin in an archival notebook (linked via OA)

numismatics-oa.png

6.13.6 Statistical Charts

Denominations issued by Augustus, Tiberius… rendered in a chart using d3js

numismatics-denominations.png

6.13.7 Kerameikos: Pottery LOD

Kerameikos Project editor. Based on XForms, leverages Getty and BM LOD

numismatics-kerameikos-editor.png

6.13.8 EADitor and xEAC

Blog, Wiki. Based on XForms. Leverages the Getty thesauri and VIAF, imports data as needed

numismatics-xEAC-import.png