Data Descriptions

Overview

Teaching: 10 min
Exercises: 15 min

Questions

1 What are Data Descriptions?

2 How to reuse Data Descriptions?

3 Are there standard ways for doing Data Descriptions?

4 What is the relation between Data Descriptions and Linked Data?

Objectives

The participant will learn that Data Descriptions are named differently in various fields.

The participant will be able to create a machine-friendly Data Descriptions file.

FAIR principles used in Data Descriptions:

Interoperable

FM-I1 (Use a Knowledge Representation Language) → doi.org/10.25504/FAIRsharing.jLpL6i
FM-I2 (Use FAIR Vocabularies) → doi.org/10.25504/FAIRsharing.jLpL6i

Reusable

FM-R1.3 (Meets Community Standards) → doi.org/10.25504/FAIRsharing.cuyPH9

1. What are Data Descriptions?

Data descriptions are a detailed explanation and documentation of each data attribute or variable in a dataset.

Depending on the use case and field of research, these data descriptions are also known as:

Codebooks (e.g. in Statistics or Social Sciences)
Data Dictionaries (e.g. in Computer and Data Science)
Labels or Data Tags (e.g. in Crowdsourcing or Humanities)
Data Glossary (e.g. in Business Administration & Finance)
Metadata (wrong use of this term in this context)

For example:

Attribute Name	Description	Data type
Litigation	Name of litigation (case law)	Text
Jurisdiction	Organisation of the legal system	Category
Sector	Concernd sector of the case	Category

“Data Descriptions” is sometimes named differently depending on the field

No matter what terminology you use, “Data Descriptions” always refers to a detailed explanation and documentation of each data attribute or variable in a dataset.

2. How to reuse Data Descriptions?

Documentation of any kind always takes time. However, we shall always aim to reuse existing data descriptions generally accepted in the community. i.e. the variable Litigation is a concept that has been widely used in legal and politcal studies; therefore, we don’t need to redefine it every time.

For example, in EU Vocabularies, we can find existing descriptions of Litigation. These descriptions belong to an Ontology, i.e. a community-accepted online dictionary for curated terms and definitions. Moreover, it provides a globally unique identifier to the description. → LINK TO EXAMPLE

eurovoc 1
You will get several results when searching for a term and its definition. These results regard the different ontologies that define these terms. For example, think of the description of a musical instrument. It might be defined differently in a British dictionary than in an American one.
eurovoc 2
Finally, using this Ontology, you can get a standard definition that community experts curate has a global identifier.

eurovoc 3

Describe your data by reusing Ontology terms

By resuing Ontology terms or community-accepted vocabularies, we aim to create a culture of recycling definitions by default.

Advantages

We don’t have to redefine the terms every time

We get a permanent link to the resource

Since it uses a global identifier becomes easier for others to integrate with different data sources
Disadvantages

Sometimes, you might not find an Ontology or vocabulary that fits your variable.

3. Are there standard ways for doing Data Descriptions?

There are no standard ways of doing Data Descriptions.
The minimum elements you need to describe your dataset are the Attribute Name and the Link to Description. You can do that in a tabular format. However, following the FAIR principles of Interoperability and Reusability, we must ensure that the data is described using community standard FAIR vocabularies. Here are some Ontologies for general use that can cover a wide variety of data attributes

Ontology	Link	About what?
Schema.org	LINK	Definitions of generic things e.g. “Computer”
DBpedia	LINK	Definitions from Wikipedia
Dublin Core	LINK	Definitions about Metadata
EVO: EVent Ontology	LINK	Definitions about historical events (Digital Humanities)
ROAR: Reconstructions and Observations in Archival Resources	LINK	Genealogists, Archaeologists and Archivists use it to describe the reconstruction of lives (or places)
The Music Ontology Specification	LINK	Main concepts and properties for describing music (i.e. artists, albums and tracks) on the Semantic Web.
AudioSet ontology	LINK	All types of sounds, everyday sounds, from human and animal sounds, to natural and environmental sounds, to musical and miscellaneous sounds.
Lexvo.org	LINK	Definitions about languages, words, characters, and other human language-related entities

There are also public registries where you can find Ontologies

CLARIAH - Curated list of Ontologies for Digital Humanities → github.com/CLARIAH/awesome-humanities-ontologies
EU Vocabularies: → op.europa.eu/en/web/eu-vocabularies
Linked Open Vocabularies → lov.linkeddata.es/dataset/lov/
BioPortal: → bioportal.bioontology.org/
AgroPortal: → agroportal.lirmm.fr/
EcoPortal → ecoportal.lifewatchitaly.eu/ Ontology Lookup Service by the EBI → ebi.ac.uk/ols/index

Exercise - Level Easy 🌶

Visit EU Vocabularies. EU Vocabularies is the reference website for curated vocabularies maintained by the Publications Office of the European Union.

Search for an Ontology term for Sector In the “Search our catalogue” search box.

Select one result that describes Sector that is sound to you.

What are the definition and ID?

Solution

Definition: A sector can be a subgroup of an economic activity - as in “coal mining sector” - or a group of economic activities - as in “service sector” - or a cross-section of a group of economic activities - as in “informal sector”. “Sector” is also a specific term used in the 1993 United Nations System of National Accounts to denote one of the five mutually exclusive institutional sectors that group together institutional units on the basis of their principal functions, behaviour and objectives, namely: nonfinancial corporations, financial corporations, general government, non-profit institutions serving households (NPISHs) and households.
ID: http://publications.europa.eu/resource/authority/sdmxglossary2018/SECTOR

The Data Descriptions are usually manually written in a tabular format. This document has the length and depth that the data owner sees fit. The general rule of thumb is to describe the dataset related to a publication. Any accessible format like .csv, .xls, or similar is acceptable.

tabular

In case it is a database, a data model must be included in machine-readable format (e.g. .sql) and a human-friendly diagram (e.g. ER model on .pdf)

There are examples where data descriptions are made available in a human-relatable manner, such as the dataset nutrition labels style (Holland et al., 2018).
An example is A Statutory Article Retrieval Dataset → LINK TO EXAMPLE
Moreover, there are tools and software packages to generate automated “Codebooks” by only looking at the dataset
An example is Automatic Codebooks from Metadata Encoded in Dataset Attributes → LINK TO EXAMPLE.

These initiatives are helping us standardise data descriptions and are “Human Friendly”, which works perfectly. However, the FAIR principles FM-I1, FM-I2 and FM-R1.3 explicitly mention the need for Linked Data formats in order to gain the maximum level of Interoperability.

4. What is the relation between Data Descriptions and Linked Data?

When we create comprehensive Data Descriptions reusing terminologies of existing Ontologies, we could make available our dataset in a Linked Data format which makes it Interoperable with other datasets out there.

Imagine your dataset can be helpful to another researcher in the future. The future researcher can reuse your dataset and combine both by matching variable names with the Ontology identifiers.

There are several tools that help you to convert your dataset from a conventional format into a Linked Data format (e.g. RDF format)

Tool	Source	GUI	Note
Open Refine	LINK	✅	Installation can be a hassle and takes a lot of memory
RMLmapper	LINK	❌	Highly technical you need to know command line tools, prefered option of data engineers
SDM-RDFizer	LINK	❌	You need to be familiar with programming languages
SPARQL-Generate	LINK	✅	It is a good option if you are going to invest time in it since you can learn SPARQL language
Virtuoso Universal Serve	LINK	✅	It’s nice but you have to pay for a license
UM LDWizard	LINK	✅	It’s free, get the job done quickly, and you can publish data if you have a TriplyDB account → RECOMMENDED

Exercise - Level Hard 🌶🌶🌶

Transform a dataset from XLSX format to RDF format using UM LDWizard

Download the following mock dataset: MOCK DATA

What ontology terms did you reuse to describe the data attributes?

Solution

There is no one single answer 🤓

Discussion

Scenario:
You are a digital history researcher, and your group will digitalize technology gadgets and artefacts that are not used anymore (e.g. VHS player), and it’s time to create data descriptions. However, there are no available Ontologies to describe the data records, given that they are contemporary history

Discuss with your team what the researcher should do given that apparently there are no available Ontologies to describe their data

Key Points

‘Codebook’ or ‘data glossary’ are some other ways to name Data Descriptions.

Ontologies (in information science) are like public online vocabularies of community curated terms and their definitions.

By resuing Ontology terms or community accepted vocabularies, we aim to create a culture of recycling terminology by default.

previous episode

Circular Research Data Coursebook. 2nd Edition

next episode

Data Descriptions

Overview

FAIR principles used in Data Descriptions:

1. What are Data Descriptions?