Data Descriptions

Overview

Teaching: 10 min
Exercises: 15 min
Questions
  • 1 What are Data Descriptions?

  • 2 How to reuse Data Descriptions?

  • 3 Are there standard ways for doing Data Descriptions?

  • 4 What is the relation between Data Descriptions and Linked Data?

Objectives
  • The participant will learn that Data Descriptions are named differently in various fields.

  • The participant will be able to create a machine-friendly Data Descriptions file.

FAIR principles used in Data Descriptions:

Interoperable

FM-I1 (Use a Knowledge Representation Language) → doi.org/10.25504/FAIRsharing.jLpL6i
FM-I2 (Use FAIR Vocabularies) → doi.org/10.25504/FAIRsharing.jLpL6i

Reusable

FM-R1.3 (Meets Community Standards) → doi.org/10.25504/FAIRsharing.cuyPH9

1. What are Data Descriptions?

Data descriptions are a detailed explanation and documentation of each data attribute or variable in a dataset.

Depending on the use case and field of research, these data descriptions are also known as:

For example:

Attribute Name Description Data type
Litigation Name of litigation (case law) Text
Jurisdiction Organisation of the legal system Category
Sector Concernd sector of the case Category

“Data Descriptions” is sometimes named differently depending on the field

No matter what terminology you use, “Data Descriptions” always refers to a detailed explanation and documentation of each data attribute or variable in a dataset.

2. How to reuse Data Descriptions?

Documentation of any kind always takes time. However, we shall always aim to reuse existing data descriptions generally accepted in the community. i.e. the variable Litigation is a concept that has been widely used in legal and politcal studies; therefore, we don’t need to redefine it every time.

For example, in EU Vocabularies, we can find existing descriptions of Litigation. These descriptions belong to an Ontology, i.e. a community-accepted online dictionary for curated terms and definitions. Moreover, it provides a globally unique identifier to the description. → LINK TO EXAMPLE


eurovoc 1
You will get several results when searching for a term and its definition. These results regard the different ontologies that define these terms. For example, think of the description of a musical instrument. It might be defined differently in a British dictionary than in an American one.
eurovoc 2
Finally, using this Ontology, you can get a standard definition that community experts curate has a global identifier.

eurovoc 3

Describe your data by reusing Ontology terms

By resuing Ontology terms or community-accepted vocabularies, we aim to create a culture of recycling definitions by default.

Advantages

  • We don’t have to redefine the terms every time
  • We get a permanent link to the resource
  • Since it uses a global identifier becomes easier for others to integrate with different data sources
    Disadvantages
  • Sometimes, you might not find an Ontology or vocabulary that fits your variable.

3. Are there standard ways for doing Data Descriptions?

There are no standard ways of doing Data Descriptions.
The minimum elements you need to describe your dataset are the Attribute Name and the Link to Description. You can do that in a tabular format. However, following the FAIR principles of Interoperability and Reusability, we must ensure that the data is described using community standard FAIR vocabularies. Here are some Ontologies for general use that can cover a wide variety of data attributes

Ontology Link About what?
Schema.org LINK Definitions of generic things e.g. “Computer”
DBpedia LINK Definitions from Wikipedia
Dublin Core LINK Definitions about Metadata
EVO: EVent Ontology LINK Definitions about historical events (Digital Humanities)
ROAR: Reconstructions and Observations in Archival Resources LINK Genealogists, Archaeologists and Archivists use it to describe the reconstruction of lives (or places)
The Music Ontology Specification LINK Main concepts and properties for describing music (i.e. artists, albums and tracks) on the Semantic Web.
AudioSet ontology LINK All types of sounds, everyday sounds, from human and animal sounds, to natural and environmental sounds, to musical and miscellaneous sounds.
Lexvo.org LINK Definitions about languages, words, characters, and other human language-related entities

There are also public registries where you can find Ontologies

CLARIAH - Curated list of Ontologies for Digital Humanitiesgithub.com/CLARIAH/awesome-humanities-ontologies
EU Vocabularies:op.europa.eu/en/web/eu-vocabularies
Linked Open Vocabularieslov.linkeddata.es/dataset/lov/
BioPortal:bioportal.bioontology.org/
AgroPortal:agroportal.lirmm.fr/
EcoPortalecoportal.lifewatchitaly.eu/ Ontology Lookup Service by the EBIebi.ac.uk/ols/index


Exercise - Level Easy 🌶

  1. Visit EU Vocabularies. EU Vocabularies is the reference website for curated vocabularies maintained by the Publications Office of the European Union.
  2. Search for an Ontology term for Sector In the “Search our catalogue” search box.
  3. Select one result that describes Sector that is sound to you.
  4. What are the definition and ID?

Solution

Definition: A sector can be a subgroup of an economic activity - as in “coal mining sector” - or a group of economic activities - as in “service sector” - or a cross-section of a group of economic activities - as in “informal sector”. “Sector” is also a specific term used in the 1993 United Nations System of National Accounts to denote one of the five mutually exclusive institutional sectors that group together institutional units on the basis of their principal functions, behaviour and objectives, namely: nonfinancial corporations, financial corporations, general government, non-profit institutions serving households (NPISHs) and households.
ID: http://publications.europa.eu/resource/authority/sdmxglossary2018/SECTOR

The Data Descriptions are usually manually written in a tabular format. This document has the length and depth that the data owner sees fit. The general rule of thumb is to describe the dataset related to a publication. Any accessible format like .csv, .xls, or similar is acceptable.

tabular

In case it is a database, a data model must be included in machine-readable format (e.g. .sql) and a human-friendly diagram (e.g. ER model on .pdf)

There are examples where data descriptions are made available in a human-relatable manner, such as the dataset nutrition labels style (Holland et al., 2018).
An example is A Statutory Article Retrieval DatasetLINK TO EXAMPLE
Moreover, there are tools and software packages to generate automated “Codebooks” by only looking at the dataset
An example is Automatic Codebooks from Metadata Encoded in Dataset AttributesLINK TO EXAMPLE.

These initiatives are helping us standardise data descriptions and are “Human Friendly”, which works perfectly. However, the FAIR principles FM-I1, FM-I2 and FM-R1.3 explicitly mention the need for Linked Data formats in order to gain the maximum level of Interoperability.

4. What is the relation between Data Descriptions and Linked Data?

When we create comprehensive Data Descriptions reusing terminologies of existing Ontologies, we could make available our dataset in a Linked Data format which makes it Interoperable with other datasets out there.

Imagine your dataset can be helpful to another researcher in the future. The future researcher can reuse your dataset and combine both by matching variable names with the Ontology identifiers.

There are several tools that help you to convert your dataset from a conventional format into a Linked Data format (e.g. RDF format)

Tool Source GUI Note
Open Refine LINK Installation can be a hassle and takes a lot of memory
RMLmapper LINK Highly technical you need to know command line tools, prefered option of data engineers
SDM-RDFizer LINK You need to be familiar with programming languages
SPARQL-Generate LINK It is a good option if you are going to invest time in it since you can learn SPARQL language
Virtuoso Universal Serve LINK It’s nice but you have to pay for a license
UM LDWizard LINK It’s free, get the job done quickly, and you can publish data if you have a TriplyDB account → RECOMMENDED

Exercise - Level Hard 🌶🌶🌶

  1. Transform a dataset from XLSX format to RDF format using UM LDWizard

  2. Download the following mock dataset: MOCK DATA

  3. What ontology terms did you reuse to describe the data attributes?

Solution

There is no one single answer 🤓

Discussion

Scenario:
You are a digital history researcher, and your group will digitalize technology gadgets and artefacts that are not used anymore (e.g. VHS player), and it’s time to create data descriptions. However, there are no available Ontologies to describe the data records, given that they are contemporary history

Discuss with your team what the researcher should do given that apparently there are no available Ontologies to describe their data

Key Points

  • ‘Codebook’ or ‘data glossary’ are some other ways to name Data Descriptions.

  • Ontologies (in information science) are like public online vocabularies of community curated terms and their definitions.

  • By resuing Ontology terms or community accepted vocabularies, we aim to create a culture of recycling terminology by default.