skills/open-science/fair-data
FAIR Data Principles
The FAIR principles — Findable, Accessible, Interoperable, Reusable — are a 15-point framework for making research data usable by humans and machines. Originally published by Wilkinson et al. in 2016 (Sci Data 3:160018), FAIR is now the de-facto policy language for funders (NIH, ERC, Wellcome, Horizon Europe), journals (Nature, PLOS, Science), and institutional data-management plans. This skill provides a decision tree for choosing a metadata standard, a repository, and a license, plus a self-assessment rubric to grade any dataset.
When to use
- Drafting a Data Management Plan (DMP) for NIH, NSF, ERC, Wellcome, or Horizon Europe.
- Choosing a domain repository (GEO, SRA, ENA, PRIDE, MetaboLights, BioImage Archive, EMDB, PDB, GenBank) versus a generalist one (Zenodo, Figshare, Dryad).
- Writing a metadata file for a sequencing, proteomics, metabolomics, imaging, or structural dataset.
- Preparing a data availability statement for a manuscript.
- Auditing an existing dataset against the 15 FAIR sub-principles (F1–F4, A1–A2, I1–I3, R1–R1.3).
- Responding to reviewers who say "data not available" or "metadata insufficient."
When NOT to use
- For code release (not data) — see
ors-open-science-code-release. - For licensing decisions specifically — see
ors-open-science-licensing. - For privacy/HIPAA/GDPR of human-subject data — see
ors-ethics-compliance-(separate skill). - For data that is genuinely embargoed, classified, or commercially sensitive — FAIR is aspirational; restrictions change F/A.
Prerequisites
- ORCID iD for the data author (https://orcid.org/).
- Familiarity with the data type (sequencing reads, mass spec peaks, microscope images, etc.).
- A decision on license: CC0 or CC-BY-4.0 are the FAIR defaults; see
ors-open-science-licensing. - Repository account (Zenodo, ENA, PRIDE, etc.).
Core workflow
- Identify data type and minimum information standard. Match the assay to the MIxS-style checklist (see "Document patterns" below).
- Choose domain repository first, generalist second. Use a community-curated repository if one exists; fall back to Zenodo/Figshare for non-standard outputs (figures, code, supplementary files).
- Reserve a persistent identifier (PID). Obtain a DOI from the repository at submission time. PIDs make the data findable AND citable.
- Write machine-readable metadata. Use the repository's required schema (e.g., MINSEQE for RNA-seq, MIAPE for proteomics, REMBI for bioimaging, SDRF for proteomics).
- Apply an open license. CC0 or CC-BY-4.0 for data; CC-BY-4.0 preferred when attribution is desired.
- Use a standard access protocol. HTTPS download is acceptable; controlled-access human data uses dbGaP/EGA with a Data Access Committee.
- Link data to publication and code. Include the dataset DOI in the paper, and the paper DOI in the dataset record (bidirectional citation).
- Self-assess with the FAIR rubric in the "Validation" section.
Document patterns
Pattern 1: The 15 FAIR sub-principles (Wilkinson 2016)
| Group | ID | Sub-principle |
|---|---|---|
| Findable | F1 | (Meta)data are assigned a globally unique and persistent identifier. |
| F2 | Data are described with rich metadata. | |
| F3 | Metadata clearly and explicitly include the identifier of the data they describe. | |
| F4 | (Meta)data are registered or indexed in a searchable resource. | |
| Accessible | A1 | (Meta)data are retrievable by their identifier using a standardised communications protocol. |
| A1.1 | The protocol is open, free, and universally implementable. | |
| A1.2 | The protocol allows for an authentication and authorisation procedure where necessary. | |
| A2 | Metadata are accessible, even when the data are no longer available. | |
| Interoperable | I1 | (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. |
| I2 | (Meta)data use vocabularies that follow FAIR principles. | |
| I3 | (Meta)data include qualified references to other (meta)data. | |
| Reusable | R1 | (Meta)data are richly described with a plurality of accurate and relevant attributes. |
| R1.1 | (Meta)data are released with a clear and accessible data usage license. | |
| R1.2 | (Meta)data are associated with detailed provenance. | |
| R1.3 | (Meta)data meet domain-relevant community standards. |
Pattern 2: Domain-specific minimum-information standards
| Assay / data type | Standard | Notes |
|---|---|---|
| Microarray expression | MIAME | Minimum Information About a Microarray Experiment (FGED). |
| RNA-seq | MINSEQE | Minimum Information about a Sequencing Experiment. |
| Proteomics (MS) | MIAPE-MS | Minimum Information About a Proteomics Experiment. |
| Proteomics (gel/in-gel) | MIAPE-GE | |
| Glycomics | MIRAGE | |
| Metabolomics (MS) | MIAPE-MS + MetaboLights mandatory | |
| Metabolomics (NMR) | MIAPE-NMR | |
| Genomics/metagenomics | MIxS (GSC) | GSC: Genomic Standards Consortium. |
| Bioimaging | REMBI (2021) | Recommended Metadata for Biological Images. |
| Light microscopy | OME-XML / OME-TIFF | Open Microscopy Environment. |
| Flow cytometry | MIFlowCyt (FCS) | |
| Stem cells | MISFISHIE | |
| Sample metadata (any -omics) | SDRF-Proteomics, ENA sample checklist, ENA library | Sample-Data-Relationship File. |
| Computational models | MIASE / SED-ML / COMBINE | |
| 3D structures | PDBx/mmCIF | |
| Crystallography | mmCIF + structure factors | Deposited in PDB. |
| NMR structures | BMRB | |
| Cryo-EM | EMDB + PDB | Map + model; both required for full deposit. |
| Genomic sequence | GenBank/EMBL/DDBJ flatfile | INSDC coordinated. |
Pattern 3: Repository selection decision tree
Is there a community-curated domain repository for this data type?
├── YES → use it (GEO/SRA/ENA for seq, PRIDE for MS-proteomics,
│ MetaboLights for metabolomics, BioImage Archive for images,
│ EMDB+PDB for structures, GenBank for annotated sequences)
│ • Domain repos enforce metadata standards (R1.3)
│ • Domain repos mint DOIs
│ • Domain repos are indexed in EBI/NCBI portals (F4)
└── NO → use a generalist repository:
• Zenodo (CERN-hosted, 50 GB per record, GitHub integration)
• Figshare (DPI, 5 GB free, 20 GB institutional)
• Dryad (data-only, curation fee, 300 GB)
• Dataverse (institutional option)
→ Generalist repos still mint DOIs; you supply metadata.
Pattern 4: A minimal FAIR data availability statement (for a manuscript)
"Raw RNA-seq reads (fastq) are deposited at the NCBI Sequence Read Archive under BioProject accession PRJNA123456 (reviewer link: https://dataview.ncbi.nlm.nih.gov/...). Processed count matrices, sample metadata, and differential expression tables are deposited at Zenodo (DOI: 10.5281/zenodo.1234567). Analysis code and the Snakemake workflow are archived at Zenodo (DOI: 10.5281/zenodo.7654321). All data are released under CC-BY-4.0; code under MIT."
Common pitfalls
| Pitfall | Why it fails | Fix |
|---|---|---|
| Data on a personal/lab website | No PID, no metadata, no long-term preservation (A1, A2) | Deposit in Zenodo/Figshare to mint a DOI. |
| Data in a journal supplement | Journal supplements disappear with the journal subscription; not a PID (F1, A2) | Deposit independently; link via DOI in the data-availability statement. |
| Metadata in a free-text README | Not machine-readable (I1, I2) | Use the repository's controlled vocabulary (e.g., SDRF-Proteomics, ENA sample checklist). |
| License missing | Fails R1.1; cannot be reused legally | Apply CC0 (data) or CC-BY-4.0 (data) at deposit time. |
| Controlled vocabulary bypassed (e.g., free-text "liver" instead of UBERON:0002107) | I2 fails; not interoperable | Use ontology IDs: UBERON (anatomy), ChEBI (chemicals), NCBI Taxonomy (organisms), EFO (experimental factors). |
| Human data on an open server | Legal/ethical breach (HIPAA, GDPR) | Use controlled access: dbGaP (US) or EGA (EU). |
| "Available upon reasonable request" | Wilkinson 2016 and most funder mandates (NIH 2023, Horizon Europe) treat this as non-compliance | Actually deposit; or, if a Data Access Committee is genuinely required, state the DAC and the access procedure explicitly. |
| No link from paper to data | F3 fails; metadata doesn't reference the data | Include the DOI in the data-availability statement. |
| No link from data to paper | I3 fails; data don't reference the paper | Update the Zenodo record with the published DOI after publication. |
| Metadata describes file format, not content | "Fastq.gz, 12 GB" is not R1 | Describe sample, organism, assay, library prep, instrument, read length, depth. |
| Reusing data without citation | Open data still needs attribution (CC-BY) | Cite by DOI; provide a CITATION.cff or CREDITS file. |
Validation
A quick FAIR self-assessment. Score 1 point per satisfied sub-principle (max 15). Most journals expect ≥ 12/15 for a "data paper."
Findable (max 4)
- F1: DOI or accession ID assigned?
- F2: Metadata file is human-readable, ideally > 20 fields?
- F3: Metadata explicitly includes the data DOI/accession?
- F4: Indexed in a searchable resource (Google Dataset Search, FAIRsharing, re3data)?
Accessible (max 4)
- A1: Retrievable by identifier over HTTPS?
- A1.1: Protocol open, free, universal?
- A1.2: Access protocol supports auth where required (e.g., dbGaP/EGA)?
- A2: Metadata persists even if data is withdrawn?
Interoperable (max 3)
- I1: Format is open and broadly supported (e.g., fastq.gz, mzML, OME-TIFF)?
- I2: Vocabularies are shared/FAIR (ontologies, not free text)?
- I3: Qualified cross-references to other datasets/publications?
Reusable (max 4)
- R1: Rich attributes (sample, instrument, protocol, software versions)?
- R1.1: License declared (CC0, CC-BY-4.0)?
- R1.2: Provenance captured (workflow, parameters, versions)?
- R1.3: Meets community standard (MIAME/MINSEQE/MIAPE/MIxS/REMBI/etc.)?
Total: ____ / 15
Open alternatives
| Commercial / restricted tool | Open alternative | Trade-off |
|---|---|---|
| Geneious (Biomatters) | Benchling (cloud) or pure CLI (BWA + samtools + IGV desktop) | Benchling is open for academics; Geneious is closed-source. |
| PRIDE Inspector (proprietary build) | OpenMS TOPPView, pyOpenMS | All open source; UI differs. |
| Web-based METADATA editors (e.g., Genedata Expressionist) | ENA Webin, BioSample submission, SDRF-Proteomics text editor | ENA is free; manual but standards-compliant. |
| Cloud-locked data (AWS-only, Google-only) | Zenodo, EBI, NCBI, S3 with a DOI in FAIRsharing | Use cloud only for embargoed data; public data should be domain-repository first. |
References
- Wilkinson, M. D. et al.. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018. doi:10.1038/sdata.2016.18.
- GO-FAIR — FAIR principles official site: https://www.go-fair.org/fair-principles/
- FAIRsharing (registry of standards, databases, policies): https://fairsharing.org/
- re3data (Registry of Research Data Repositories): https://www.re3data.org/
- ELIXIR FAIR Cookbook: https://elixir-europe.org/services/fair-cookbook
- FORCE11 FAIR Principles: https://force11.org/info/the-fair-data-principles/
- Creative Commons license chooser: https://creativecommons.org/choose/
- NIH Genomic Data Sharing Policy: https://sharing.nih.gov/genomic-data-sharing-policy
- Genomic Standards Consortium (MIxS checklists): https://gensc.org/mixs/
- EBI MetaboLights: https://www.ebi.ac.uk/metabolights/
- EBI PRIDE (proteomics): https://www.ebi.ac.uk/pride/
- NCBI GEO / SRA: https://www.ncbi.nlm.nih.gov/geo/ and https://www.ncbi.nlm.nih.gov/sra
- ENA (European Nucleotide Archive): https://www.ebi.ac.uk/ena
- BioImage Archive: https://www.ebi.ac.uk/bioimage-archive/
- EMDB / PDB: https://www.ebi.ac.uk/emdb/ and https://www.rcsb.org/
- Zenodo: https://zenodo.org/
- FAIR Data Maturity Model (Research Data Alliance): https://www.rd-alliance.org/
Related skills
ors-open-science-licensing— picking CC0 / CC-BY / CC-BY-SA for the dataset.ors-open-science-code-release— releasing the code that produced the data.ors-open-science-preprints— pairing a data deposit with a preprint for early citation.ors-data-engineering-dvc-data-version-control— DVC for large file versioning.
Changelog
- 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram. Synthesised Wilkinson et al. 2016 (the canonical source); GO-FAIR; FAIRsharing; ELIXIR FAIR Cookbook; domain standards (MIAME, MINSEQE, MIAPE, MIRAGE, MIxS, REMBI, MIASE); major repositories (ENA, GEO/SRA, PRIDE, MetaboLights, BioImage Archive, EMDB, PDB, GenBank, Zenodo). Decision tree, 15-sub-principle table, and self-assessment rubric are original compositions.
