Data Descriptions
Overview
Teaching: 10 min
Exercises: 15 minQuestions
1 What are Data Descriptions?
2 How to reuse Data Descriptions?
3 Are there standard ways for doing Data Descriptions?
4 What is the relation between Data Descriptions and Linked Data?
Objectives
The participant will learn that Data Descriptions are named differently in various fields.
The participant will be able to create a machine-friendly Data Descriptions file.
FAIR principles used in Data Descriptions:
Interoperable
FM-I1 (Use a Knowledge Representation Language) → doi.org/10.25504/FAIRsharing.jLpL6i
FM-I2 (Use FAIR Vocabularies) → doi.org/10.25504/FAIRsharing.jLpL6iReusable
FM-R1.3 (Meets Community Standards) → doi.org/10.25504/FAIRsharing.cuyPH9
1. What are Data Descriptions?
Data descriptions are a detailed explanation and documentation of each data attribute or variable in a dataset.
Depending on the use case and field of research, these data descriptions are also known as:
- Codebooks (e.g. in Statistics or Social Sciences)
- Data Dictionaries (e.g. in Computer and Data Science)
- Labels or Data Tags (e.g. in Crowdsourcing or Humanities)
- Data Glossary (e.g. in Business Administration & Finance)
- Metadata (wrong use of this term in this context)
For example:
Attribute Name | Description | Data type |
---|---|---|
Litigation | Name of litigation (case law) | Text |
Jurisdiction | Organisation of the legal system | Category |
Sector | Concernd sector of the case | Category |
“Data Descriptions” is sometimes named differently depending on the field
No matter what terminology you use, “Data Descriptions” always refers to a detailed explanation and documentation of each data attribute or variable in a dataset.
2. How to reuse Data Descriptions?
Documentation of any kind always takes time. However, we shall always aim to reuse existing data descriptions generally accepted in the community. i.e. the variable Litigation
is a concept that has been widely used in legal and politcal studies; therefore, we don’t need to redefine it every time.
For example, in EU Vocabularies, we can find existing descriptions of Litigation
. These descriptions belong to an Ontology, i.e. a community-accepted online dictionary for curated terms and definitions. Moreover, it provides a globally unique identifier to the description. → LINK TO EXAMPLE
You will get several results when searching for a term and its definition. These results regard the different ontologies that define these terms. For example, think of the description of a musical instrument. It might be defined differently in a British dictionary than in an American one.
Finally, using this Ontology, you can get a standard definition that community experts curate has a global identifier.
Describe your data by reusing Ontology terms
By resuing Ontology terms or community-accepted vocabularies, we aim to create a culture of recycling definitions by default.
Advantages
- We don’t have to redefine the terms every time
- We get a permanent link to the resource
- Since it uses a global identifier becomes easier for others to integrate with different data sources
Disadvantages- Sometimes, you might not find an Ontology or vocabulary that fits your variable.
3. Are there standard ways for doing Data Descriptions?
There are no standard ways of doing Data Descriptions.
The minimum elements you need to describe your dataset are the Attribute Name and the Link to Description. You can do that in a tabular format. However, following the FAIR principles of Interoperability and Reusability, we must ensure that the data is described using community standard FAIR vocabularies. Here are some Ontologies for general use that can cover a wide variety of data attributes
Ontology | Link | About what? |
---|---|---|
Schema.org | LINK | Definitions of generic things e.g. “Computer” |
DBpedia | LINK | Definitions from Wikipedia |
Dublin Core | LINK | Definitions about Metadata |
EVO: EVent Ontology | LINK | Definitions about historical events (Digital Humanities) |
ROAR: Reconstructions and Observations in Archival Resources | LINK | Genealogists, Archaeologists and Archivists use it to describe the reconstruction of lives (or places) |
The Music Ontology Specification | LINK | Main concepts and properties for describing music (i.e. artists, albums and tracks) on the Semantic Web. |
AudioSet ontology | LINK | All types of sounds, everyday sounds, from human and animal sounds, to natural and environmental sounds, to musical and miscellaneous sounds. |
Lexvo.org | LINK | Definitions about languages, words, characters, and other human language-related entities |
There are also public registries where you can find Ontologies
CLARIAH - Curated list of Ontologies for Digital Humanities → github.com/CLARIAH/awesome-humanities-ontologies
EU Vocabularies: → op.europa.eu/en/web/eu-vocabularies
Linked Open Vocabularies → lov.linkeddata.es/dataset/lov/
BioPortal: → bioportal.bioontology.org/
AgroPortal: → agroportal.lirmm.fr/
EcoPortal → ecoportal.lifewatchitaly.eu/ Ontology Lookup Service by the EBI → ebi.ac.uk/ols/index
Exercise - Level Easy 🌶
- Visit EU Vocabularies. EU Vocabularies is the reference website for curated vocabularies maintained by the Publications Office of the European Union.
- Search for an Ontology term for
Sector
In the “Search our catalogue” search box.- Select one result that describes
Sector
that is sound to you.- What are the definition and ID?
Solution
Definition: A sector can be a subgroup of an economic activity - as in “coal mining sector” - or a group of economic activities - as in “service sector” - or a cross-section of a group of economic activities - as in “informal sector”. “Sector” is also a specific term used in the 1993 United Nations System of National Accounts to denote one of the five mutually exclusive institutional sectors that group together institutional units on the basis of their principal functions, behaviour and objectives, namely: nonfinancial corporations, financial corporations, general government, non-profit institutions serving households (NPISHs) and households.
ID: http://publications.europa.eu/resource/authority/sdmxglossary2018/SECTOR
The Data Descriptions are usually manually written in a tabular format. This document has the length and depth that the data owner sees fit. The general rule of thumb is to describe the dataset related to a publication. Any accessible format like .csv
, .xls
, or similar is acceptable.
In case it is a database, a data model must be included in machine-readable format (e.g. .sql
) and a human-friendly diagram (e.g. ER model on .pdf
)
There are examples where data descriptions are made available in a human-relatable manner, such as the dataset nutrition labels style (Holland et al., 2018).
An example is A Statutory Article Retrieval Dataset → LINK TO EXAMPLE
Moreover, there are tools and software packages to generate automated “Codebooks” by only looking at the dataset
An example is Automatic Codebooks from Metadata Encoded in Dataset Attributes → LINK TO EXAMPLE.
These initiatives are helping us standardise data descriptions and are “Human Friendly”, which works perfectly. However, the FAIR principles FM-I1, FM-I2 and FM-R1.3 explicitly mention the need for Linked Data formats in order to gain the maximum level of Interoperability.
4. What is the relation between Data Descriptions and Linked Data?
When we create comprehensive Data Descriptions reusing terminologies of existing Ontologies, we could make available our dataset in a Linked Data format which makes it Interoperable with other datasets out there.
Imagine your dataset can be helpful to another researcher in the future. The future researcher can reuse your dataset and combine both by matching variable names with the Ontology identifiers.
There are several tools that help you to convert your dataset from a conventional format into a Linked Data format (e.g. RDF format)
Tool | Source | GUI | Note |
---|---|---|---|
Open Refine | LINK | ✅ | Installation can be a hassle and takes a lot of memory |
RMLmapper | LINK | ❌ | Highly technical you need to know command line tools, prefered option of data engineers |
SDM-RDFizer | LINK | ❌ | You need to be familiar with programming languages |
SPARQL-Generate | LINK | ✅ | It is a good option if you are going to invest time in it since you can learn SPARQL language |
Virtuoso Universal Serve | LINK | ✅ | It’s nice but you have to pay for a license |
UM LDWizard | LINK | ✅ | It’s free, get the job done quickly, and you can publish data if you have a TriplyDB account → RECOMMENDED |
Exercise - Level Hard 🌶🌶🌶
Transform a dataset from XLSX format to RDF format using UM LDWizard
Download the following mock dataset: MOCK DATA
What ontology terms did you reuse to describe the data attributes?
Solution
There is no one single answer 🤓
Discussion
Scenario:
You are a digital history researcher, and your group will digitalize technology gadgets and artefacts that are not used anymore (e.g. VHS player), and it’s time to create data descriptions. However, there are no available Ontologies to describe the data records, given that they are contemporary historyDiscuss with your team what the researcher should do given that apparently there are no available Ontologies to describe their data
Key Points
‘Codebook’ or ‘data glossary’ are some other ways to name Data Descriptions.
Ontologies (in information science) are like public online vocabularies of community curated terms and their definitions.
By resuing Ontology terms or community accepted vocabularies, we aim to create a culture of recycling terminology by default.