Documentation- VinLOD

Introduction
Mapping the Items
Metadata standards
XML/TEI and HTML documents
Text Analysis
From * to RDF
Conclusions
References and Resources

Introduction

This project was developed for the Information Science and Cultural Heritage exam within the University of Bologna's Master's degree program in Digital Humanities and Digital Knowledge. Its core purpose is to semantically link 10 cultural heritage objects on the web using RDF, with the overarching goal of contributing to the preservation and enhanced accessibility of cultural heritage.

Our workflow began with the selection of objects inspired by the anime and manga Vinland Saga. Each chosen item was then organized and interlinked within a conceptual map in order to visually highlight the entity relationships between the selected items.

Subsequently, all associated information (including textual transcriptions in XML/TEI) was captured using appropriate standard metadata and converted into RDF (Resource Description Framework), a machine-readable and interoperable format.

The entire process allowed us to establish robust semantic connections between the objects, their descriptions and their contexts, thereby enriching the digital landscape of cultural heritage, with the scope of ensuring its longevity and discoverability for future students and users.

Mapping the Items

The items were selected through different forms of linkage:

narrative connections, established through historical figures, objects and events depicted in the storyline of the anime and the manga items;
item-level relationships, involving objects directly related to the primary item, such as the anime’s opening video, considered as a component or derivative of the main work;
indirect associations, where objects were linked to related entities rather than to the primary item itself — for instance, Wagner’s Ride of the Valkyries was connected through The Poetic Edda, a textual source relevant to the narrative universe of the anime.

To reflect the variety inherent in cultural heritage, the selected objects included different types (or Classes) — such as a painting, a coin, a book, an audio recording, and a video — sourced from libraries, archives, museums (LAMs) and digital repositories. Each object was described using metadata standards suited to its format and institutional context [See here: Metadata standards].

This approach enabled the integration of heterogeneous resources into a coherent conceptual framework, illustrating how multiple forms of relationship — narrative, structural, and thematic — can support the creation of a unified dataset across diverse domains.

Our workflow began with two concept maps built from the work (in FRBR terms) Vinland Saga. Rather than linking this work directly to the items, we introduced intermediate entities — such as events, concepts, characters, and related works — to create a richer, multi-layered network. These entities are visually differentiated, as it can be seen in the legend:

events appear as grey rectangles
concepts as cyan clouds
characters as stylized figures
related works as cyan circles

The predicates in this early stage mainly come from CIDOC-CRM (events) and FRBROO (people). When no suitable standard predicate was available, we defined new ones within our VinLOD namespace to maintain semantic consistency.

To reflect different levels of abstraction from the manifestation to the item, we followed the FRBR model. For books and music — forms of art that can exist in multiple manifestations — we used an idea-to-manifestation approach: first connecting through the conceptual entity, then linking to the tangible, material object.

The concept maps were created step by step: starting from an initial draft in natural language (Theoretical Map), then refined through research on each object and by studying how institutions had catalogued and described them through appropriate metadata standards. At this stage, we ensured compatibility and enriched the map (Conceptual Map) with additions and modifications, respecting the domain and range requirements of each predicate.

Metadata standards

The selection of metadata standards began by reviewing the websites hosting the objects we intended to include in our project. Wherever possible, we chose to implement the same standards adopted by the hosting institutions, while always considering the need for conversion into RDF and ensuring compatibility with the final data model.

In cases where it was not possible to obtain metadata directly from institutional sources, or the standard was not suitable in a LOD perspective, we carried out additional research to determine the most appropriate standards for each specific type of item on our own.

For the metadata retrieved from institutions, we downloaded and converted their files (all of them were XML) into RDF using Python, selectively extracting only the elements that aligned with our initial vision of the map and supplementing them when necessary.

Where objects did not have metadata available for download, we manually created CSV files, which were then converted into RDF [See here: From * to RDF].

Below is a detailed list of all the metadata standards adopted, their purpose, references to official documentation, and the specific objects to which they were applied, together with the rationale behind each choice.

LIDO/ NMO: For the coin object, we chose to adopt the same metadata standard already used by the host institution, namely the Münzkabinett of Berlin. LIDO is an XML schema specifically designed for describing cultural heritage objects and their metadata in a structured and hierarchical way. Its primary strength lies in defining which elements and sub-elements can be used and how they should be ordered, enabling rich, multilingual descriptions suitable for a wide range of cultural, artistic, technological, and natural science objects.

The LIDO schema emerged from a substantial redesign of CDWA Lite and museumdat, incorporating feedback from the community and further analysis aligned with the CIDOC CRM model.

In practical terms, LIDO is particularly useful for organizing the internal structure of metadata: it facilitates a nested, hierarchical classification that proceeds from more general categories to increasingly specific details. However, since it is an XML schema, it does not provide a semantic layer: it does not define the meaning of data in a machine-interpretable way as an ontology would, nor does it support RDF natively.
For this reason, we complemented LIDO with the Numismatic Ontology (NMO), which provides the semantic layer needed to describe the coin in RDF. Through NMO, we were able to add precise semantic properties regarding physical description, chronological attribution, provenance, and collection details, each linked to persistent identifiers. This combination allowed us to retain the structural richness of LIDO while enriching it with the semantic expressiveness and interoperability of RDF provided by NMO.
MARC21 is a widely adopted metadata standard designed to store and exchange bibliographic data about a broad range of materials: books, manuscripts, maps, music, digital resources, and more. A MARC record consists of three main elements: the record structure (based on international standards like ISO 2709), the content designators (tags, indicators, and subfield codes), and the actual data content, which is typically defined by cataloging rules or controlled vocabularies such as ISBD and LCSH.

Although it has been fundamental for libraries for decades, MARC21 has limitations when it comes to the modern web. It relies on a hierarchical, sequential XML model that structures data well but lacks explicit semantic meaning. Since it was not designed to work natively with RDF, instead of describing our resources directly with MARC21 properties, it was necessary the use of BIBFRAME. BIBFRAME (Bibliographic Framework) was created specifically by the Library of Congress (LOC) to replace MARC21 and transition bibliographic metadata into the paradigm of Linked Open Data (LOD).

BIBFRAME provides an RDF vocabulary made up of classes and properties that describe resources and express their connections. This semantic richness allows us to model bibliographic and cultural heritage data in a way that is interoperable and compatible with other datasets on the web.

We also partially used BIBFRAME to describe the Ride of the Valkyries, since MARC21—and therefore BIBFRAME—is designed to handle not only bibliographic records but also recordings.
Dublin Core (named after the city of Dublin, Ohio) is a widely used metadata standard built around a basic set of elements designed to describe virtually any type of digital content—such as videos, images, and web pages—as well as physical resources like books, CDs, and even artworks shared over computer networks.

The full name of the initiative is the Dublin Core Metadata Initiative (DCMI), which emerged within the context of the OCLC (Online Computer Library Center), a major U.S.-based cooperative for library services. In March 1995, a conference was held in Dublin, Ohio, bringing together a diverse group of librarians, archivists, publishers, software developers, and representatives from working groups of the Internet Engineering Task Force (IETF). They shared a common goal: to define a standardized way to enhance discovery and access to digital information.

The idea was to establish a core set of descriptive elements that could be supplied directly by content creators or publishers, embedded within digital objects themselves or linked externally. This collective effort led to the design of a metadata architecture suited to the needs of content producers, vendors, and users alike.

Originally introduced in December 1996, the core set included fifteen fundamental elements, which later evolved to support additional refinements and qualifiers while retaining its straightforward structure.

For this reason, we applied Dublin Core across multiple object types—such as manuscripts, audio files, and paintings—because its metadata elements are quite broad and flexible, particularly when describing the roles of creators as well as temporal and spatial properties.
Similarly, this is the case for Schema.org, a collaborative initiative launched in 2011 by major search engines—Google, Bing, Yahoo!, and Yandex—to create a unified vocabulary for structured data on the web. Its primary goal is to standardize how information about various types of content is marked up in HTML, improving search engines’ ability to understand and present rich results. Schema.org covers a broad range of categories, including creative works, events, organizations, and more, making it highly adaptable for different domains. Because of its wide adoption and flexibility, Schema.org is especially useful for describing multimedia resources such as videos, images, paintings, and illustrations. It supports rich semantic annotations that help connect content across the web, facilitating better discoverability and interoperability. This versatility made it a natural choice for our project to describe diverse object types consistently. We mostly used Schema for the objects that did not have the metadata standards clearly specified or for those that were taken from Wikidata.
JVMG/ACLICK: While designing our data mapping strategy, we wanted to avoid relying on overly generic solutions that might not capture the specific nature of the data we worked with. For this reason, we looked for an existing standard that could better suit our needs. After searching online, we came across a standard that is currently under development: the Japanese Visual Media Graph. To better understand its scope and applications, we contacted the institution responsible for it, and eventually decided to adopt it as the core of our mapping process. The project aims to build an open, graph-based research database focused on Japanese visual media such as anime, manga, games, and visual novels. Targeted at scholars in Japanese studies, it combines rich data harvested from enthusiast communities with advanced search and analysis tools. Initially, the team collaborated closely with these online communities to gain access to their curated data, ensured its quality, and negotiated licensing agreements compatible with research and reuse. Technically, the database is open source, inspired by Wikidata’s architecture, and designed to be extensible, transparent, and researcher-friendly. Now, in its second phase (2023–2026), the project is expanding its data sources, deepening international collaborations, and refining the technical infrastructure. It also supports researchers and students through workshops, open documentation, and example studies that demonstrate how the knowledge graph can be used to advance research in Japanese media. They are collaborating with the AnimeClick community, an Italian-language fan site dedicated to Japanese anime, manga, and live action drama, and have access to their data through an individual licensing agreement. We then selected what we needed from their GitHub repository and from their ontology — partially built using Protégé — and used targeted, domain-specific properties such as aclick:statusInHomeCountry, jvmg:genre, jvmg:titleOriginal, and others specifically created for this type of content.
CIDOC Conceptual Reference Model (CIDOC-CRM) is a comprehensive ontology developed to facilitate the integration, mediation, and interchange of heterogeneous cultural heritage information. Originating from the International Council of Museums (ICOM), CIDOC-CRM provides a formal structure to represent cultural artifacts, their provenance, and the events surrounding their lifecycle, enabling museums, archives, and libraries to share data semantically and consistently. Its strength lies in modeling complex relationships between objects, actors, places, and times in a way that preserves the context and meaning of cultural heritage information. On the other hand, FRBRoo (Functional Requirements for Bibliographic Records – object oriented) is an ontology that merges the concepts of the FRBR model, widely used in bibliographic and library science, with CIDOC-CRM’s rich semantic framework. Developed collaboratively by the CIDOC CRM Special Interest Group and the library community, FRBRoo extends CIDOC-CRM by specifically addressing bibliographic records, linking intellectual works, their expressions, manifestations, and items within the broader cultural heritage context. This integration allows for a unified, interoperable data model that supports detailed semantic descriptions of both tangible and intangible cultural heritage and bibliographic resources. The relationship between CIDOC-CRM and FRBRoo thus bridges museum documentation and library cataloging, fostering cross-domain data exchange and enhancing the discoverability and reuse of cultural and bibliographic information on the semantic web.
As previously mentioned, we utilized these standards to link entities to objects and to define the classes of the objects themselves. All objects have a corresponding CIDOC-CRM class assigned through rdf:type.

XML/TEI and HTML documents

The TEI (Text Encoding Initiative) is a widely used XML-based standard developed to represent texts in digital form, particularly within the humanities. Its goal is to make textual resources not only digitally preservable but also richly describable and interoperable, so that they can be shared and studied across different platforms and research projects.

XML (Extensible Markup Language) is a flexible markup language designed to structure, store, and transport data.

For this project, we created two XML/TEI files, each dedicated to a different textual object:

A ballad taken from the Poetic Edda, translated by Henry Adams Bellows
An excerpt from the manuscript Finding of Wineland

Each file includes a teiHeader, which holds the metadata describing the text. Beyond the mandatory parts — namely the <fileDesc> element (containing title, publication statement, and source description) — we also added an <encodingDesc> section. Inside it, we included:

A concise <projectDesc> outlining the purpose and scope of our work
An <editorialDecl> detailing our editorial choices, including how we handled corrections, normalizations, hyphenation, and other interventions necessary to make the text clearer without losing fidelity to the source

After encoding the texts in XML, we transformed them into HTML using XSLT (Extensible Stylesheet Language Transformations). XSLT is a powerful language specifically created to process XML documents and convert them into other formats — most commonly HTML — which can then be displayed in a web browser. This transformation step is crucial because, while XML structures and preserves data, it isn’t meant to be read directly by end users. HTML, on the other hand, is designed for presentation: it structures text, links, images, and interactive elements so that the content can be easily navigated and read online.

Special attention was given to entities we identified as central in our conceptual graph — people, places, and events. Whenever these appeared in the text, we marked them so they would become clickable. This approach supports interoperability, as these links can potentially connect to other datasets or resources, and it offers readers the possibility to explore related information if they wish.

We also included notes, which are displayed interactively: by clicking on highlighted text, the user can see footnotes at the bottom of the page or open contextual notes in a side panel. This choice enriches the reading experience, balancing the scholarly depth of critical apparatus with usability and modern web standards.

The transformation from XML to HTML was carried out in Python using the lxml library, which provides robust tools for parsing XML and applying XSLT stylesheets.

In the script, the XML file and the XSLT stylesheet are first parsed; then, the transformation is performed by creating an XSLT object and applying it to the XML document. Finally, the result is serialized and written to an HTML file.

Overall, by combining the descriptive power of XML and TEI with the flexibility of XSLT and the web-native format of HTML, our project aims to create digital editions that are both faithful to the original texts and engaging for contemporary readers.

Text Analysis

For the two XML/TEI files, we first prepared a Python script to extract the plain text from the HTML files and save it into .txt format using the etree library for parsing and navigating the HTML structure. Once we had the plain text, we developed the main Python script for text analysis.

This script defines a class called TextAnalyzer, designed to help explore and analyze text files through various natural language processing techniques. It uses popular libraries such as NLTK for basic text processing, spaCy for named entity recognition, scikit-learn for computing TF-IDF scores, and VADER for sentiment analysis.

With this tool, you can load and clean your text by tokenizing it and removing common English stopwords and punctuation. You can then analyze word frequencies, extract frequent n-grams (such as bigrams), and detect meaningful collocations—pairs of words that appear together more often than by chance.

Additionally, the class allows you to inspect how specific words are used in context via concordance views, identify key terms with TF-IDF weighting, recognize named entities like people or places, and perform sentiment analysis to gauge the overall emotional tone of the text.

Using the TextAnalyzer is simple: after initializing it with the path to your text file, you call the load_and_tokenize() method to prepare the data. Then you can run any of the analysis methods as needed.

This makes TextAnalyzer a flexible and effective tool for text mining, ideal for small to medium-sized corpora and for conducting exploratory textual analysis.

Here is an example with the top 10 most common words in the Poetic Edda:

Term	Count
’	31
geirröth	15
othin	15
gods	15
thou	11
set	10
shall	10
agnar	9
king	9
men	9

Although the corpus is relatively small (just over 1100 tokens), conducting a textual analysis proves invaluable for understanding the nature of the text. Notably, the most frequent token is the Saxon genitive apostrophe, which is naturally included among the terms due to its semantic significance—while all other punctuation marks have been filtered out as stopwords. This finding aligns well with the epic character of the work, where numerous names appear, typically presented within familial formulas such as ‘father of’ or ‘son of,’ a common feature in mythological narrative traditions.

Let us now consider another example using the TF-IDF method:

Term	TF-IDF
’	0.1343
othin	0.1123
geirröth	0.1123
gods	0.1123
thou	0.1029
shall	0.1001
set	0.1001
men	0.0969
agnar	0.0969
king	0.0969

According to the TF-IDF analysis, it is understandable that othin ranks second: as noted in the introduction to the transcription of the Poetic Edda on our website, his name holds a central place in the narrative because, as he himself states, the text is dense with his epithets which have been fully annotated in the particDesc due to their significance to the story.

From * to RDF

As previously said, the main objective and final step of the project was the transformation of heterogeneous data sources into RDF format. In the end, every object we worked on—whether stored in CSV or XML—was converted to RDF.

In order to do so, we developed different Python scripts tailored to the specific format and structure of each input file.

The transformation process always involved:

Identifying a suitable subject URI for the resource.
Mapping predicates according to established ontologies.
Differentiating between literals and resources, with careful attention to domain and range compatibility.
Enriching the graph with contextual information wherever available, including relationships with people, places, institutions, or time periods.

For the objects without downloadable metadata available—such as the anime opening video, the Wagner recording, the illustration, the painting, the manga, and the anime—we worked with metadata stored in CSV format. These files included rows predicates and objects to be associated with a given subject URI.

The logic behind the transformation was modular: for each row, the script parsed the predicate (resolving it to a full URI through a prefix mapping) and determined whether the object should be treated as a URI or a literal. We also took care to bind all relevant namespaces to the RDF graph for readability and standardization.

Other objects came with XML metadata, in various schemas—mainly LIDO, MARC and TEI. For these, we wrote scripts that navigated and parsed the XML structure, extracted relevant information, and mapped it to RDF triples.

In the case of the two texts encoded in XML/TEI, we treated the TEI file itself as another manifestation of the work and extracted metadata primarily from the TeiHeader, emphasizing bibliographic and descriptive information about the digital edition and its context.

All RDF graphs were serialized in Turtle format. This choice supported better readability and easier version control. The serialized files are structured to reflect the subject–predicate–object pattern clearly and allow further integration or querying via SPARQL endpoints.

The main libraries and tools used included:

RDFlib for building and serializing RDF graphs.
pandas for reading and manipulating CSV files.
lxml for parsing XML files.

Each script was customized for the input format and the type of object being described, but we maintained a shared structure and approach to ensure overall coherence.

This data transformation phase played a crucial role in the project, not only enabling interoperability but also supporting a deeper reflection on metadata quality, structure, and meaning.

The only Turtle file not generated through Python scripts was VinLOD.ttl.

This file was written manually for its very specific purpose within the dataset: it acts as a bridge connecting the work—in this case, Vinland Saga—to the actual objects described in our RDF collection.

In VinLOD.ttl, we defined:

Classes (e.g., Person, Event, Place) that represent core concepts of the narrative and historical universe.
Properties describing relationships between these entities (such as genealogical links, participation in historical events, or connections between narrative characters and real historical figures).
Alignments and semantic mappings to existing ontologies and vocabularies (e.g., CIDOC CRM, FRBRoo, schema.org, Wikidata, DBpedia) using rdfs:subClassOf or owl:sameAs. This alignment ensures that our dataset remains interoperable and understandable within the wider Linked Open Data ecosystem.
Annotations and multilingual labels to clarify meaning and enhance accessibility.

Its aim was not simply to describe individual objects, but rather to establish a semantic network that connects the narrative world to historically documented people, places, events and, finally, their items.

Conclusions

This project demonstrates how transforming heterogeneous cultural heritage data into RDF is more than a purely technical exercise: it is a process of knowledge construction. By explicitly modelling relationships, contexts, and multiple layers of meaning, we shift from isolated descriptions to a rich semantic network that supports discovery, reuse, and long-term preservation.

In recent years, the web itself has evolved in this direction. From a primarily document-centric system, it has become an ever-expanding constellation of interconnected data—what we call the Semantic Web. Graph structures now lie at the heart of this transformation: they are the underlying models that allow machines to interpret and connect disparate information, turning raw data into a navigable, contextualized body of knowledge.

Major platforms and search engines—Google, Microsoft, and many others—have embraced this paradigm, building and querying vast knowledge graphs to enhance search, recommendation, and content understanding. The same logic applies to cultural heritage: by structuring data as linked graphs, we make objects and narratives not only visible but also meaningfully discoverable, both by people and by algorithms in the LAM environment.

Through VinLOD, we aimed to show how a small, curated dataset—even one inspired by popular culture like Vinland Saga—can become part of this larger, interoperable web of knowledge. Our conceptual maps, metadata choices, and manual alignments to existing ontologies illustrate that cultural heritage is not just a collection of objects: it is a dynamic network of meanings, contexts, and interpretations that can keep evolving and connecting across time, space, and disciplines.

In the end, the act of modelling and linking data is itself an act of storytelling: one that preserves the past, enriches the present, and opens new paths for future research, creativity, and shared understanding.