DOCUMENTATION

of VinLOD Saga

Table of Contents

Introduction

This project was developed for the Information Science and Cultural Heritage exam within the University of Bologna's Master's degree program in Digital Humanities and Digital Knowledge. Its core purpose is to semantically link 10 cultural heritage objects on the web using RDF, with the overarching goal of contributing to the preservation and enhanced accessibility of cultural heritage.

Our workflow began with the selection of objects inspired by the anime and manga Vinland Saga. Each chosen item was then organized and interlinked within a conceptual map in order to visually highlight the entity relationships between the selected items.

Subsequently, all associated information (including textual transcriptions in XML/TEI) was captured using appropriate standard metadata and converted into RDF (Resource Description Framework), a machine-readable and interoperable format.

The entire process allowed us to establish robust semantic connections between the objects, their descriptions and their contexts, thereby enriching the digital landscape of cultural heritage, with the scope of ensuring its longevity and discoverability for future students and users.

Mapping the Items

The items were selected through different forms of linkage:

To reflect the variety inherent in cultural heritage, the selected objects included different types (or Classes) — such as a painting, a coin, a book, an audio recording, and a video — sourced from libraries, archives, museums (LAMs) and digital repositories. Each object was described using metadata standards suited to its format and institutional context [See here: Metadata standards].

This approach enabled the integration of heterogeneous resources into a coherent conceptual framework, illustrating how multiple forms of relationship — narrative, structural, and thematic — can support the creation of a unified dataset across diverse domains.

Our workflow began with two concept maps built from the work (in FRBR terms) Vinland Saga. Rather than linking this work directly to the items, we introduced intermediate entities — such as events, concepts, characters, and related works — to create a richer, multi-layered network. These entities are visually differentiated, as it can be seen in the legend:

The predicates in this early stage mainly come from CIDOC-CRM (events) and FRBROO (people). When no suitable standard predicate was available, we defined new ones within our VinLOD namespace to maintain semantic consistency.

To reflect different levels of abstraction from the manifestation to the item, we followed the FRBR model. For books and music — forms of art that can exist in multiple manifestations — we used an idea-to-manifestation approach: first connecting through the conceptual entity, then linking to the tangible, material object.

The concept maps were created step by step: starting from an initial draft in natural language (Theoretical Map), then refined through research on each object and by studying how institutions had catalogued and described them through appropriate metadata standards. At this stage, we ensured compatibility and enriched the map (Conceptual Map) with additions and modifications, respecting the domain and range requirements of each predicate.

Metadata standards

The selection of metadata standards began by reviewing the websites hosting the objects we intended to include in our project. Wherever possible, we chose to implement the same standards adopted by the hosting institutions, while always considering the need for conversion into RDF and ensuring compatibility with the final data model.

In cases where it was not possible to obtain metadata directly from institutional sources, or the standard was not suitable in a LOD perspective, we carried out additional research to determine the most appropriate standards for each specific type of item on our own.

For the metadata retrieved from institutions, we downloaded and converted their files (all of them were XML) into RDF using Python, selectively extracting only the elements that aligned with our initial vision of the map and supplementing them when necessary.

Where objects did not have metadata available for download, we manually created CSV files, which were then converted into RDF [See here: From * to RDF].

Below is a detailed list of all the metadata standards adopted, their purpose, references to official documentation, and the specific objects to which they were applied, together with the rationale behind each choice.

XML/TEI and HTML documents

The TEI (Text Encoding Initiative) is a widely used XML-based standard developed to represent texts in digital form, particularly within the humanities. Its goal is to make textual resources not only digitally preservable but also richly describable and interoperable, so that they can be shared and studied across different platforms and research projects.

XML (Extensible Markup Language) is a flexible markup language designed to structure, store, and transport data.

For this project, we created two XML/TEI files, each dedicated to a different textual object:

Each file includes a teiHeader, which holds the metadata describing the text. Beyond the mandatory parts — namely the <fileDesc> element (containing title, publication statement, and source description) — we also added an <encodingDesc> section. Inside it, we included:

After encoding the texts in XML, we transformed them into HTML using XSLT (Extensible Stylesheet Language Transformations). XSLT is a powerful language specifically created to process XML documents and convert them into other formats — most commonly HTML — which can then be displayed in a web browser. This transformation step is crucial because, while XML structures and preserves data, it isn’t meant to be read directly by end users. HTML, on the other hand, is designed for presentation: it structures text, links, images, and interactive elements so that the content can be easily navigated and read online.

Special attention was given to entities we identified as central in our conceptual graph — people, places, and events. Whenever these appeared in the text, we marked them so they would become clickable. This approach supports interoperability, as these links can potentially connect to other datasets or resources, and it offers readers the possibility to explore related information if they wish.

We also included notes, which are displayed interactively: by clicking on highlighted text, the user can see footnotes at the bottom of the page or open contextual notes in a side panel. This choice enriches the reading experience, balancing the scholarly depth of critical apparatus with usability and modern web standards.

The transformation from XML to HTML was carried out in Python using the lxml library, which provides robust tools for parsing XML and applying XSLT stylesheets.

In the script, the XML file and the XSLT stylesheet are first parsed; then, the transformation is performed by creating an XSLT object and applying it to the XML document. Finally, the result is serialized and written to an HTML file.

Overall, by combining the descriptive power of XML and TEI with the flexibility of XSLT and the web-native format of HTML, our project aims to create digital editions that are both faithful to the original texts and engaging for contemporary readers.

Text Analysis

For the two XML/TEI files, we first prepared a Python script to extract the plain text from the HTML files and save it into .txt format using the etree library for parsing and navigating the HTML structure. Once we had the plain text, we developed the main Python script for text analysis.

This script defines a class called TextAnalyzer, designed to help explore and analyze text files through various natural language processing techniques. It uses popular libraries such as NLTK for basic text processing, spaCy for named entity recognition, scikit-learn for computing TF-IDF scores, and VADER for sentiment analysis.

With this tool, you can load and clean your text by tokenizing it and removing common English stopwords and punctuation. You can then analyze word frequencies, extract frequent n-grams (such as bigrams), and detect meaningful collocations—pairs of words that appear together more often than by chance.

Additionally, the class allows you to inspect how specific words are used in context via concordance views, identify key terms with TF-IDF weighting, recognize named entities like people or places, and perform sentiment analysis to gauge the overall emotional tone of the text.

Using the TextAnalyzer is simple: after initializing it with the path to your text file, you call the load_and_tokenize() method to prepare the data. Then you can run any of the analysis methods as needed.

This makes TextAnalyzer a flexible and effective tool for text mining, ideal for small to medium-sized corpora and for conducting exploratory textual analysis.

Here is an example with the top 10 most common words in the Poetic Edda:

Term Count
31
geirröth 15
othin 15
gods 15
thou 11
set 10
shall 10
agnar 9
king 9
men 9

Although the corpus is relatively small (just over 1100 tokens), conducting a textual analysis proves invaluable for understanding the nature of the text. Notably, the most frequent token is the Saxon genitive apostrophe, which is naturally included among the terms due to its semantic significance—while all other punctuation marks have been filtered out as stopwords. This finding aligns well with the epic character of the work, where numerous names appear, typically presented within familial formulas such as ‘father of’ or ‘son of,’ a common feature in mythological narrative traditions.

Let us now consider another example using the TF-IDF method:

Term TF-IDF
0.1343
othin 0.1123
geirröth 0.1123
gods 0.1123
thou 0.1029
shall 0.1001
set 0.1001
men 0.0969
agnar 0.0969
king 0.0969

According to the TF-IDF analysis, it is understandable that othin ranks second: as noted in the introduction to the transcription of the Poetic Edda on our website, his name holds a central place in the narrative because, as he himself states, the text is dense with his epithets which have been fully annotated in the particDesc due to their significance to the story.

From * to RDF

As previously said, the main objective and final step of the project was the transformation of heterogeneous data sources into RDF format. In the end, every object we worked on—whether stored in CSV or XML—was converted to RDF.

In order to do so, we developed different Python scripts tailored to the specific format and structure of each input file.

The transformation process always involved:

For the objects without downloadable metadata available—such as the anime opening video, the Wagner recording, the illustration, the painting, the manga, and the anime—we worked with metadata stored in CSV format. These files included rows predicates and objects to be associated with a given subject URI.

The logic behind the transformation was modular: for each row, the script parsed the predicate (resolving it to a full URI through a prefix mapping) and determined whether the object should be treated as a URI or a literal. We also took care to bind all relevant namespaces to the RDF graph for readability and standardization.

Other objects came with XML metadata, in various schemas—mainly LIDO, MARC and TEI. For these, we wrote scripts that navigated and parsed the XML structure, extracted relevant information, and mapped it to RDF triples.

In the case of the two texts encoded in XML/TEI, we treated the TEI file itself as another manifestation of the work and extracted metadata primarily from the TeiHeader, emphasizing bibliographic and descriptive information about the digital edition and its context.

All RDF graphs were serialized in Turtle format. This choice supported better readability and easier version control. The serialized files are structured to reflect the subject–predicate–object pattern clearly and allow further integration or querying via SPARQL endpoints.

The main libraries and tools used included:

Each script was customized for the input format and the type of object being described, but we maintained a shared structure and approach to ensure overall coherence.

This data transformation phase played a crucial role in the project, not only enabling interoperability but also supporting a deeper reflection on metadata quality, structure, and meaning.

The only Turtle file not generated through Python scripts was VinLOD.ttl.

This file was written manually for its very specific purpose within the dataset: it acts as a bridge connecting the work—in this case, Vinland Saga—to the actual objects described in our RDF collection.

In VinLOD.ttl, we defined:

Its aim was not simply to describe individual objects, but rather to establish a semantic network that connects the narrative world to historically documented people, places, events and, finally, their items.

Conclusions

This project demonstrates how transforming heterogeneous cultural heritage data into RDF is more than a purely technical exercise: it is a process of knowledge construction. By explicitly modelling relationships, contexts, and multiple layers of meaning, we shift from isolated descriptions to a rich semantic network that supports discovery, reuse, and long-term preservation.

In recent years, the web itself has evolved in this direction. From a primarily document-centric system, it has become an ever-expanding constellation of interconnected data—what we call the Semantic Web. Graph structures now lie at the heart of this transformation: they are the underlying models that allow machines to interpret and connect disparate information, turning raw data into a navigable, contextualized body of knowledge.

Major platforms and search engines—Google, Microsoft, and many others—have embraced this paradigm, building and querying vast knowledge graphs to enhance search, recommendation, and content understanding. The same logic applies to cultural heritage: by structuring data as linked graphs, we make objects and narratives not only visible but also meaningfully discoverable, both by people and by algorithms in the LAM environment.

Through VinLOD, we aimed to show how a small, curated dataset—even one inspired by popular culture like Vinland Saga—can become part of this larger, interoperable web of knowledge. Our conceptual maps, metadata choices, and manual alignments to existing ontologies illustrate that cultural heritage is not just a collection of objects: it is a dynamic network of meanings, contexts, and interpretations that can keep evolving and connecting across time, space, and disciplines.

In the end, the act of modelling and linking data is itself an act of storytelling: one that preserves the past, enriches the present, and opens new paths for future research, creativity, and shared understanding.

References and Resources

BIBFRAME (Bibliographic Framework): https://www.loc.gov/bibframe/

CIDOC Conceptual Reference Model (CIDOC-CRM): https://cidoc-crm.org/

DBpedia: https://www.dbpedia.org/

Dublin Core Metadata Initiative (DCMI): https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

FRBR (Functional Requirements for Bibliographic Records): https://www.loc.gov/catdir/cpso/frbreng.pdf

FRBRoo (Functional Requirements for Bibliographic Records - object oriented): https://cidoc-crm.org/frbroo/a>

HTML (HyperText Markup Language): https://html.spec.whatwg.org/multipage/

ISO 2709: https://www.iso.org/standard/41319.html

ISBD (International Standard Bibliographic Description): https://www.ifla.org/wp-content/uploads/2019/05/assets/hq/publications/series/44-it.pdf

JVMG (Japanese Visual Media Graph): https://jvmg.iuk.hdm-stuttgart.de/, GitHub repository: https://github.com/Japanese-Visual-Media-Graph/ontologies/blob/main/jvmg.ttl

LOC (Library of Congress): https://www.loc.gov/

LIDO: https://cidoc.mini.icom.museum/working-groups/lido/lido-overview/lido-schema/

lxml: https://lxml.de/

MARC21: https://www.loc.gov/marc/bibliographic/

NLTK (Natural Language Toolkit): https://www.nltk.org/

NMO (Nomisma): https://nomisma.org/ontology

pandas: https://pandas.pydata.org/

RDFlib: https://rdflib.readthedocs.io/en/stable/LINK_LXML

Schema.org: https://schema.org/

spaCy: https://spacy.io/

Text Encoding Initiative (https://tei-c.org/): https://tei-c.org/

VADER (Valence Aware Dictionary and sEntiment Reasoner): https://vadersentiment.readthedocs.io/en/latest/

VinLOD Saga Github repository: https://github.com/VinLOD-Saga/VinLOD-Saga

W3C RDF: https://www.w3.org/RDF/

Wikidata: https://www.wikidata.org/wiki/Wikidata:Main_Page

XSLT (Extensible Stylesheet Language Transformations): https://www.w3.org/TR/xslt20/