FAIR on demand: All in One View

Content from Metadata

Last updated on 2025-01-14 | Edit this page

Estimated time 12 minutes

Outcomes

1- Define metadata and its various types.

2- Recall the community standards and how to apply them to data and metadata.

3- Define data provenance.

Meta-data

The description of your data is called metadata. In other words, if we had a data file containing data values for an assay, such as an Excel spreadsheet, column headings are used to assign meaning and context. The data values are the data in this case, and the metadata are the column headings. Any documentation or explanation of the accompanying excel file is also considered metadata.

As seen in the image below, the data is coloured orange to represent the heart rate, genotype, and patient ID. The metadata that describes what this data is (the name of the cohort, the contact e-mail for more information on the dataset) is shown in blue.

What else would metadata reveal?

To understand what else metadata can describe, let’s have a look again on the previous image, what else would you like to add to understand the data more?

Additional metadata for provenance and general descriptions can be added. More information about the cohort name, for example, is required because “Human Welsh Cohort” does not tell us much in comparison to other existing Welsh cohorts. Here, we would include a unique ID or working title for the cohort, as well as a project URL describing its origin and composition.

The column headings appear to be complete, though there are some issues with the data in orange. The diabetes status column appears to capture the disease’s type and stage. - In row 3, it is unclear whether the disease is Diabetes Mellitus or Diabetes Insipidus. - In row 4, it is unclear whether the type of diabetes mellitus is 1 or 2. - There is an empty space in the final row. It is unclear whether this is due to a lack of information or the patient does not have diabetes. So to do this better, two separate columns are created for the type and stage of the disease. The disease’s name included whether it was type 1 or type 2. You can check this in this figure alt text

Building on the previous examples,metadata can show a variety of things, such as your data’s characteristics and data provenance, which explains how your data was created. Metadata is classified into three types: descriptive, structural, and administrative.

Descriptive Metadata describes the characteristics of the dataset
Structural Metadata describes how the dataset is generated and structured internally.
Administrative metadata describes who was in charge of the data, who worked on the project, and how much money was spent.

Identify types of metadata in this microarray dataset

Let’s look at an example using microarray data from the arrayexpress database. This dataset contains data and metadata. The administrative metadata can be found in the orange square. The descriptive metadata is located in the black square, and as you can see, it summarize the dataset. The structure of dataset and files are marked by dark blue square which represents structural metadata. alt text

Exercise

From the FAIRcoobook, can you find the recipe on how to create metadata profiles for your dataset? you can start from here

Show me the solution

First of all, let’s understand the structure of the FAIRCookbook. For a quick overview, you can watch our RDMBites on FAIRcookbook FAIRcookbook RDMBites

The building unit of FAIR cookbook is called a recipe, The recipe is the term used to describe instructions for how to FAIRify your data. As you see in the image, the structure of each recipe includes these main items: 1- Graphical overview which is the mindmap for the recipe 2- Ingredients which gives you an idea for the skills needed and tools you can use to apply the recipes 3- The steps and the process 4- Recommendations of what to read next and references to your reading

So let’s use the search box and write down metadata profiles As you see the results comes up, choose metdata profiles. As we explained earlier the recipe shows necessary steps for creating metadata profiles for different data types

Data and metadata should follow Community standards

Each data type has its own community that develops guidelines to ensure that metadata and data are appropriately described. Make sure to follow the community standards when describing your data. This becomes increasingly important as your data will become more reliable for other researchers. If you decide to use other guidelines, make sure you clearly document this. The use of community standards allows your data to be reused while also making it easily interoperable across multiple platforms. We provide examples of various community standards that you can use to ensure that your data is described correctly.

Exercise

RDMkit provides a nice domain specific-training on community standards for each domain, using this training, can you find the bioimage community standards?

Show me the solution

RDMkit is The Research Data Management toolkit for Life Sciences. It provides Best practices and guidelines to help you make your data FAIR (Findable, Accessible, Interoperable and Reusable). It also provides catalogue of tools and resources for research data management.

As you can see in the above image, RDMkit covers a variety of research data management topics. The community standards are covered under domain tab. It provides community standards for all types of data.

You can find the bioimage community standards on top of the page. As you can see, it covers the following 1- What is bioimage data and metadata?

2- Standards of bioimage research data management

3- bioimage data collection

4- Data publication and archiving

Data Provenance

Provenance is the detailed description of the history of the data and how it is generated. Here is an example from arrayexpress database where there is accurate description of the data which allow the reusability of microarray data. As you can see in this example from E-MTAB-6980 dataset, there is rich description of the study design, organism, platform and timing of data collection.

Exercise

Can you extract data provenance from this data set E-MTAB-7933?

Show me the solution

As you can see in this picture, you wil find data provenance in under protocol and experimental factors tab.

Vocabularies are FAIR

The metadata and data should be described by vocabularies that comply with FAIR which means that metadata and data should be:

F globally unique and persistent identifiers
A accessible documentation that extensively describes your identifiers

I Vocabularies are interoperable

R Can be reused and interpreted easily by humans and machines

Exercise

You are researcher working in the field of food safety and you are doing clinical trial, do you know how to choose the right vocabularies and ontologies for it?

Show me the solution

It is time to introduce you to FAIRsharing, a resource for standards, databases and policies. The FAIRsharing is an important resource for researchers to help them identify the suitable repositories, standards and databases for their data. It also contains the latest policies from from governments, funders and publishers for FAIRer data.

You can use the search wizard, to look for the guidelines for reporting the data and metadata of randomized controlled trials of the livestock and food.

In the results section, you can find REFLECT guidelines.

For each resource/guideline, you will find general information, relationship graph, organization funding and maintaining the resource

Linked metadata

When uploading your dataset to any database, you should include the following information:

1- Additional datasets that supplement your data

2- It should be stated if your dataset is built on another dataset.

Exercise

How could you interlink data to your dataset?

Show me the solution

One of ways to do this, is to follow the FAIRcookbook recipes on interlinking data from different resources as presented in this graphical overview. you can find the recipe here

Metadata and data are always available

The maintenance of the data sets in the public database comes at a cost. This can be avoided by maintenance of the metadata instead. Metadata is small and can be easily maintained not only on the database but personal computer of researchers. This also the case for sensitive data where the metadata are available and provides contact details of the researchers, how to get the data and data provenance

Usually, when the data is generated, both metadata and data files are separate files. As a researcher, you should ensure that both files refer to each other.

Resources

You can learn more about how to describe your data using FAIR vocabularies and formal language for knowledge representation from the following:

Recipe from the FAIRCookbook onFAIR and the notion of metadata
RDMkit explanation of machine readabilityMachine readability
Read more about vocabularies and ontologies from Vocabularies and ontology
List of ontologies
This is a nice introduction to metadata from Ed-DaSH carpentries course Introduction to metadata
The following recipe from the FAIRCookbook provides instructions on how to create metadata profilesMetadata profiles:
RDMkit explanation on how to manage metadata: Metadata management
Domain specific repositories:

This is a lesson on types of repositories and give examples on domain specific repositories - ED-DaSH lesson on how to choose a data repository How do we choose a research data repository?

FAIRsharing provided a great information on writing domain specific metadata, you can find it here
A recipe from the FAIRcookbook on how to Interlink data from different sources
A nice guideline on How can you record data provenance
FAIRcookbook recipe on Audit of the provenence process

FAIR principles

This episode covers the following principles: (I2) (meta)data use vocabularies that follow FAIR principles

(I1) (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation

(I3) (meta)data include qualified references to other (meta)data

(A2) Metadata are accessible, even when the data are no longer available**

Content from Identifiers

Last updated on 2025-01-14 | Edit this page

Estimated time 12 minutes

Outcomes

1- Explain the definition and importance of using identifiers

2- Illustrate what are the persistent identifiers

3- Give examples of the structure of persistent identifiers

Persistent identifiers

Identifiers are a long-lasting references to a digital resources such as datasets and provides the information required to reliably identify, verify and locate your research data. Commonly, a persistent identifier is a unique record ID in a database, or unique URL that takes a researcher to the data in question, in a database.

That resource might be a publication, dataset, or person. Persistent The identifiers have to be unique, globally only your data are identified by this ID that is never used by anyone in the whole world. In addition, these IDs and must not do not become invalid over time. Watch our RDMbBites on the persistent identifiers to understand more.

Identifiers are very important concept of the FAIR principle. They are considered one of the pillars for the FAIR principles. It makes your data more Findable (F)

Find the PID

Remember our example on metadata types from arrayexpress in the first lesson, can you tell what is the persistent identifier of this dataset?

Show me the solution

The PID in this case or as it called in array express “Accession” is E-MTAB-7933. If you use this accession number, you will find the dataset. In addition, have you noticed that also the data files are named using this PID .

It is important to note that when you upload your data to a public repository, the repository will create this ID for you automatically.

the Structure of persistent identifiers

As you can see in this picture, the structure of any identifiers consist of 1- The initial resolver service: domain which is unique and specific to each community e.g. ORCID for researchers and DOI for publications 2- Prefix: Unique number that represent category e.g. for DOI specific numbers refer to the publisher and directory 3- Suffix: The unique dataset number and it is unique under its prefix

Resources

The resources listed below provide an overview of the information you need to know about identifiers. - Unique and persistent identifiers: this link provide a nice and practical explanation of the unique and persistent identifiers from FAIRCookbook

Identifiers: another nice explanation from RDMkit
Machine actionability: identifiers are also important for machine readability, a nice explanation from RDMkit that describes machine readability
Examples and explanation of different identifiers from FAIRsharing.org https://fairsharing.org/search?recordType=identifier_schema

Content from Access

Last updated on 2025-01-14 | Edit this page

Estimated time 12 minutes

Outcomes

1- To illustrate what is the communications protocol and the criteria for open and free protocol

2- To give examples of databases that uses a protocol with different authentication process

3- To interpret the usage licence associated with different data sets

Standard communication protocol

Simply put, a protocol is a method that connects two computers, the protocol ensure security, and authenticity of your data. Once the safety and authenticity of the data is verified, the transfer of data to another computer happens.

Having a protocol does not guarantee that your data are accessible. However, you can choose a protocol that is free, open and allow easy and exchange of information. One of the steps you can do is to choose the right database, so when you upload your data into database, the database executes a protocol that allows the user to load data in the user’s web browser. This protocol allows the easy access of the data but still secures the data.

Authentication process

It is the process that a protocol uses for verification. To know what authentication is, suppose we have three people named John Smith. We do not know which one submitted the data. This is through assigning a unique ID for each one that is interpreted by machines and humans so you would know who is the actual person that submitted the data. Doing so is a form of authentication. This is used by many databases like Zenodo, where you can sign-up using ORCID-ID allowing the database to identify you.

Exercise

After reading this guide on different protocol types, do you know what is the protocol used in arrayexpress? ::: solution As we explained before on how to use the RDMkit, going through the page, you will find different types of protocols explained

From this part, you can understand that the protocol used for the arrayexpress is HTTP (HyperText Transfer Protocol) (highlighted in purple)

::::

Data usage licence

This describes the legal rights on how others use your data. As you publish your data, you should describe clearly in what capacity your data can be used. Bear in mind that description of licence is important to allow machine and human reusability of your data. There are many licence that can be used e.g. MIT licence or Common creative licence. These licences provide accurate description of the rights of data reuse, Please have a look at resources in the description box to know more about these licences.

Exercise

From the this RDMkit guideline on types of licence, what is the licence used by the following datasets: 1- A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration 2- RNA-seq of circadian timeseries sampling (LL2-3) of 13-14 day old Arabidopsis thaliana Col-0 (24 h to 68 h, sampled every 4 h) ::: solution The link we provided, provided a nice explanation on types of licence and as you read the following section from RDMKit, you will find the following: From this section, you can clearly understand the type of licence used for: 1- A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration is CC-BY-4 2- RNA-seq of circadian timeseries sampling (LL2-3) of 13-14 day old Arabidopsis thaliana Col-0 (24 h to 68 h, sampled every 4 h) is CC-BY-4

::::

Sensitive data

Sensitive data are data that, if made publicly available, could cause consequences for individuals, groups, nations, or ecosystems and need to be secured from unauthorised access. To determine whether your data is sensitive, you should consult national laws, which vary by country. Through the following resources, you will know more about sensitive data and what to do if your data is sensitive

Resources

A nice recipe from FAIRcookbook on SSH protocols
A nice explanation from RDMkit on protocols and how they will help you protect your dataProtocols and safety of data transfer

Having your work licenced does not sound simple as it seems; here are some resources to help you find the correct licence for you:

Why should you assign licence to your protocol from RDMkit here
A nice recipe from FAIRcookbook with step-by-step instructions for
licence
software licence
Data licence
Declaring data permitted uses
To know more about creative common licence, check this link Creative commons licence

To get more information on sensitive data, you can have a look on these reources:

FAIR principles

This episode covers the following principles: 1- (A1) (meta)data are retrievable by their identifier using a standardised communications protocol

2- (R1.1) meta(data) are released with a clear and accessible data usage licence

Content from Registration

Last updated on 2025-01-14 | Edit this page

Estimated time 11 minutes

Outcomes

1- Define what is data repository

2- Illustrate the importance of indexed data repository

3- Summarize the steps of data indexing in a searchable repository

Indexed data repository

what is a data repository?

It is a general term used to describe any storage space you use to deposit data, metadata and any associated research. Kindly note that database is more specific and it is mainly for the storage of your data.

Types of data repository

There are many types of data repsoitory classified based on:

1- The structure of the data: Data warehouse, Data lake and Data mart

The following table summarize these differences

Data repository	Data warehouse	Data mart	Data lake
Supported data types	Structured	Highly Structured	Structured, semi-structured, unstructured, binary
Data quality	curated	Highly curated	Raw data

2- The purpose of data repository into:

Controlled access repository
Discipline specific repository
Institutional repository
General data repository

The following image summarize these types with different examples

Types of data repository, CC.BY from re3data.org

Importance of indexed data repository

To ensure data findability, your data should be uploaded to a public repository where your data can be searched and found, It will make your data comply with the fourth principle of findability (F4) which states that . There are numerous databases where you can upload your data, these are typically data-driven. Examples of these databases are ArrayExpress for microarray data and RNAseq data. These databases have a set of rules in place to make sure that your data will be FAIR.

After you upload your data into this database, they are assigned an ID and are indexed in the database. So whenever you look for the ID, or even use a keyword for your data, you will find your data.

Take a look at the ArrayExpress database where all datasets are indexed, and you can simply find any dataset using the search tools. By indexing data, you can get the dataset using any keyword other than the PID. For example, if you want to locate human NSCL cell lines, you can just type this into the search toolbox and find the dataset. Indexing and registering datasets, also means they are curated in such a way that you may discover them using different keywords.

For example, you can find the same dataset by using its identifiers or by using keywords chosen by the dataset’s authors to describe it.

When you upload your dataset to a database, it can be curated and easily found using different keywords

By indexing your dataset, you can retrieve it using its PID

Exercise

One of the things you can do to index your dataset, is to upload it to Zenodo, can you use one of the resources we recommended before to know how to do this?

1- RDMkit

2- FAIRcookbook

3- FAIRsharing

Show me the solution

Since you want a technical guideline, FAIRcookbook and RDMkit are the best to start with. We will start with FAIRcookbook As we explained before the structure of the recipe so let’s look for the suitable recipe in the FAIRcookbook So as you navigate the homepage of FAIRcookbook, you will find different tabs that covers each of FAIR principles, so for instance, if you want recipes on Accessibility of FAIR, you will find all recipes that can help you make your data accessible.

Follow the following steps to find the recipe:

1- In this exercise, we are looking for a recipe on indexing or registering dataset in a searchable resource which you can find it in the findability tab, Can you find it in this picture?

2- Click on the findability tab

3- on the left side, you will find a navigation bar which will help you find different recipes that make your data findable.

4- As you can see here, you will find a recipe on registering datasets with Wikidata and another one on depositing to generic repositories-Zenodo use case

Once you click on one of these resources, you will find the following:

Requirements that you need to apply the recipe to your dataset
The instructions
References and further readings
Authors and licence

In our specialized courses, we will give you examples on how to upload your data to specialized repository

Why should you upload your data to a database?

1- Databases assign your data a unique persistent identifier.

2- Your data will be indexed, making it easier to find.

3- Some databases will let you easily connect your dataset to other datasets.

4- Dataset licencing, with some databases offering controlled or limited access to protect your data.

By uploading data to a database, you comply with the following FAIR principles

F1 (Meta)data is assigned a globally unique and persistent identifier
F3 Metadata clearly and explicitly include the identifier of the data they describe
F4 (Meta)data is registered or indexed in a searchable resource

It will also allow your data to be more accessible as the standardized communications protocol and authentication are automatically set for your data

A1 (Meta)data is retrievable by their identifier using a standardised communications protocol
A1.1 The protocol is open, free, and universally implementable
A1.2 The protocol allows for an authentication and authorisation procedure, where necessary
A2 Metadata is accessible, even when the data is no longer available
I3 (Meta)data include qualified references to other (meta)data
R1.1 (Meta)data is released with a clear and accessible data usage license

How to choose the right database for your dataset?

1- Check the community standards for your data, you can find more information RDMkit guidelines on domain specific community standards

2- Look for resources that describe the databases and check if it fits your data, you might consider the following:

Accessibility options
Licence

One of these resources is FAIRsharing, it provides a registry for different databases and repositories. Here is an example where the FAIR sharing provides you with information regarding protein database. It has the following information

1- General information

2- Which policies use this database?

3- Related community standards

4- Organization maintaining this database

5- Documentation and support

6- Licence

Resources

Our resources provide an overview of data repositories and examples

The FAIR cookbook and RDMkit both provide excellent instructions for uploading your data into databases:

FAIRcookbook recipe on Depositing to generic repositories- Zenodo use
FAIRcookbook recipe on Registering Datasets in Wikidata
RDMkit guidelines on Data publications and depostion
RDMkit guidelines on Finding and reusing existing data
FAIRcookbook recipe on Search engine optimization
FAIRsharing offers a nice portal to different examples of databases

FAIR principles

This episode covers the following principles:

1- (F4) (meta)data are registered or indexed in a searchable resource

2- (R1.1) (Meta)data are released with a clear and accessible data usage license