skills/open-science/fair-data

stars:0
forks:0
watches:0
last updated:N/A

FAIR Data Principles

The FAIR principles — Findable, Accessible, Interoperable, Reusable — are a 15-point framework for making research data usable by humans and machines. Originally published by Wilkinson et al. in 2016 (Sci Data 3:160018), FAIR is now the de-facto policy language for funders (NIH, ERC, Wellcome, Horizon Europe), journals (Nature, PLOS, Science), and institutional data-management plans. This skill provides a decision tree for choosing a metadata standard, a repository, and a license, plus a self-assessment rubric to grade any dataset.

When to use

  • Drafting a Data Management Plan (DMP) for NIH, NSF, ERC, Wellcome, or Horizon Europe.
  • Choosing a domain repository (GEO, SRA, ENA, PRIDE, MetaboLights, BioImage Archive, EMDB, PDB, GenBank) versus a generalist one (Zenodo, Figshare, Dryad).
  • Writing a metadata file for a sequencing, proteomics, metabolomics, imaging, or structural dataset.
  • Preparing a data availability statement for a manuscript.
  • Auditing an existing dataset against the 15 FAIR sub-principles (F1–F4, A1–A2, I1–I3, R1–R1.3).
  • Responding to reviewers who say "data not available" or "metadata insufficient."

When NOT to use

  • For code release (not data) — see ors-open-science-code-release.
  • For licensing decisions specifically — see ors-open-science-licensing.
  • For privacy/HIPAA/GDPR of human-subject data — see ors-ethics-compliance- (separate skill).
  • For data that is genuinely embargoed, classified, or commercially sensitive — FAIR is aspirational; restrictions change F/A.

Prerequisites

  • ORCID iD for the data author (https://orcid.org/).
  • Familiarity with the data type (sequencing reads, mass spec peaks, microscope images, etc.).
  • A decision on license: CC0 or CC-BY-4.0 are the FAIR defaults; see ors-open-science-licensing.
  • Repository account (Zenodo, ENA, PRIDE, etc.).

Core workflow

  1. Identify data type and minimum information standard. Match the assay to the MIxS-style checklist (see "Document patterns" below).
  2. Choose domain repository first, generalist second. Use a community-curated repository if one exists; fall back to Zenodo/Figshare for non-standard outputs (figures, code, supplementary files).
  3. Reserve a persistent identifier (PID). Obtain a DOI from the repository at submission time. PIDs make the data findable AND citable.
  4. Write machine-readable metadata. Use the repository's required schema (e.g., MINSEQE for RNA-seq, MIAPE for proteomics, REMBI for bioimaging, SDRF for proteomics).
  5. Apply an open license. CC0 or CC-BY-4.0 for data; CC-BY-4.0 preferred when attribution is desired.
  6. Use a standard access protocol. HTTPS download is acceptable; controlled-access human data uses dbGaP/EGA with a Data Access Committee.
  7. Link data to publication and code. Include the dataset DOI in the paper, and the paper DOI in the dataset record (bidirectional citation).
  8. Self-assess with the FAIR rubric in the "Validation" section.

Document patterns

Pattern 1: The 15 FAIR sub-principles (Wilkinson 2016)

GroupIDSub-principle
FindableF1(Meta)data are assigned a globally unique and persistent identifier.
F2Data are described with rich metadata.
F3Metadata clearly and explicitly include the identifier of the data they describe.
F4(Meta)data are registered or indexed in a searchable resource.
AccessibleA1(Meta)data are retrievable by their identifier using a standardised communications protocol.
A1.1The protocol is open, free, and universally implementable.
A1.2The protocol allows for an authentication and authorisation procedure where necessary.
A2Metadata are accessible, even when the data are no longer available.
InteroperableI1(Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2(Meta)data use vocabularies that follow FAIR principles.
I3(Meta)data include qualified references to other (meta)data.
ReusableR1(Meta)data are richly described with a plurality of accurate and relevant attributes.
R1.1(Meta)data are released with a clear and accessible data usage license.
R1.2(Meta)data are associated with detailed provenance.
R1.3(Meta)data meet domain-relevant community standards.

Pattern 2: Domain-specific minimum-information standards

Assay / data typeStandardNotes
Microarray expressionMIAMEMinimum Information About a Microarray Experiment (FGED).
RNA-seqMINSEQEMinimum Information about a Sequencing Experiment.
Proteomics (MS)MIAPE-MSMinimum Information About a Proteomics Experiment.
Proteomics (gel/in-gel)MIAPE-GE
GlycomicsMIRAGE
Metabolomics (MS)MIAPE-MS + MetaboLights mandatory
Metabolomics (NMR)MIAPE-NMR
Genomics/metagenomicsMIxS (GSC)GSC: Genomic Standards Consortium.
BioimagingREMBI (2021)Recommended Metadata for Biological Images.
Light microscopyOME-XML / OME-TIFFOpen Microscopy Environment.
Flow cytometryMIFlowCyt (FCS)
Stem cellsMISFISHIE
Sample metadata (any -omics)SDRF-Proteomics, ENA sample checklist, ENA librarySample-Data-Relationship File.
Computational modelsMIASE / SED-ML / COMBINE
3D structuresPDBx/mmCIF
CrystallographymmCIF + structure factorsDeposited in PDB.
NMR structuresBMRB
Cryo-EMEMDB + PDBMap + model; both required for full deposit.
Genomic sequenceGenBank/EMBL/DDBJ flatfileINSDC coordinated.

Pattern 3: Repository selection decision tree

Is there a community-curated domain repository for this data type?
├── YES → use it (GEO/SRA/ENA for seq, PRIDE for MS-proteomics,
│         MetaboLights for metabolomics, BioImage Archive for images,
│         EMDB+PDB for structures, GenBank for annotated sequences)
│         • Domain repos enforce metadata standards (R1.3)
│         • Domain repos mint DOIs
│         • Domain repos are indexed in EBI/NCBI portals (F4)
└── NO  → use a generalist repository:
          • Zenodo (CERN-hosted, 50 GB per record, GitHub integration)
          • Figshare (DPI, 5 GB free, 20 GB institutional)
          • Dryad (data-only, curation fee, 300 GB)
          • Dataverse (institutional option)
          → Generalist repos still mint DOIs; you supply metadata.

Pattern 4: A minimal FAIR data availability statement (for a manuscript)

"Raw RNA-seq reads (fastq) are deposited at the NCBI Sequence Read Archive under BioProject accession PRJNA123456 (reviewer link: https://dataview.ncbi.nlm.nih.gov/...). Processed count matrices, sample metadata, and differential expression tables are deposited at Zenodo (DOI: 10.5281/zenodo.1234567). Analysis code and the Snakemake workflow are archived at Zenodo (DOI: 10.5281/zenodo.7654321). All data are released under CC-BY-4.0; code under MIT."

Common pitfalls

PitfallWhy it failsFix
Data on a personal/lab websiteNo PID, no metadata, no long-term preservation (A1, A2)Deposit in Zenodo/Figshare to mint a DOI.
Data in a journal supplementJournal supplements disappear with the journal subscription; not a PID (F1, A2)Deposit independently; link via DOI in the data-availability statement.
Metadata in a free-text READMENot machine-readable (I1, I2)Use the repository's controlled vocabulary (e.g., SDRF-Proteomics, ENA sample checklist).
License missingFails R1.1; cannot be reused legallyApply CC0 (data) or CC-BY-4.0 (data) at deposit time.
Controlled vocabulary bypassed (e.g., free-text "liver" instead of UBERON:0002107)I2 fails; not interoperableUse ontology IDs: UBERON (anatomy), ChEBI (chemicals), NCBI Taxonomy (organisms), EFO (experimental factors).
Human data on an open serverLegal/ethical breach (HIPAA, GDPR)Use controlled access: dbGaP (US) or EGA (EU).
"Available upon reasonable request"Wilkinson 2016 and most funder mandates (NIH 2023, Horizon Europe) treat this as non-complianceActually deposit; or, if a Data Access Committee is genuinely required, state the DAC and the access procedure explicitly.
No link from paper to dataF3 fails; metadata doesn't reference the dataInclude the DOI in the data-availability statement.
No link from data to paperI3 fails; data don't reference the paperUpdate the Zenodo record with the published DOI after publication.
Metadata describes file format, not content"Fastq.gz, 12 GB" is not R1Describe sample, organism, assay, library prep, instrument, read length, depth.
Reusing data without citationOpen data still needs attribution (CC-BY)Cite by DOI; provide a CITATION.cff or CREDITS file.

Validation

A quick FAIR self-assessment. Score 1 point per satisfied sub-principle (max 15). Most journals expect ≥ 12/15 for a "data paper."

Findable (max 4)

  • F1: DOI or accession ID assigned?
  • F2: Metadata file is human-readable, ideally > 20 fields?
  • F3: Metadata explicitly includes the data DOI/accession?
  • F4: Indexed in a searchable resource (Google Dataset Search, FAIRsharing, re3data)?

Accessible (max 4)

  • A1: Retrievable by identifier over HTTPS?
  • A1.1: Protocol open, free, universal?
  • A1.2: Access protocol supports auth where required (e.g., dbGaP/EGA)?
  • A2: Metadata persists even if data is withdrawn?

Interoperable (max 3)

  • I1: Format is open and broadly supported (e.g., fastq.gz, mzML, OME-TIFF)?
  • I2: Vocabularies are shared/FAIR (ontologies, not free text)?
  • I3: Qualified cross-references to other datasets/publications?

Reusable (max 4)

  • R1: Rich attributes (sample, instrument, protocol, software versions)?
  • R1.1: License declared (CC0, CC-BY-4.0)?
  • R1.2: Provenance captured (workflow, parameters, versions)?
  • R1.3: Meets community standard (MIAME/MINSEQE/MIAPE/MIxS/REMBI/etc.)?

Total: ____ / 15

Open alternatives

Commercial / restricted toolOpen alternativeTrade-off
Geneious (Biomatters)Benchling (cloud) or pure CLI (BWA + samtools + IGV desktop)Benchling is open for academics; Geneious is closed-source.
PRIDE Inspector (proprietary build)OpenMS TOPPView, pyOpenMSAll open source; UI differs.
Web-based METADATA editors (e.g., Genedata Expressionist)ENA Webin, BioSample submission, SDRF-Proteomics text editorENA is free; manual but standards-compliant.
Cloud-locked data (AWS-only, Google-only)Zenodo, EBI, NCBI, S3 with a DOI in FAIRsharingUse cloud only for embargoed data; public data should be domain-repository first.

References

Related skills

  • ors-open-science-licensing — picking CC0 / CC-BY / CC-BY-SA for the dataset.
  • ors-open-science-code-release — releasing the code that produced the data.
  • ors-open-science-preprints — pairing a data deposit with a preprint for early citation.
  • ors-data-engineering-dvc-data-version-control — DVC for large file versioning.

Changelog

  • 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram. Synthesised Wilkinson et al. 2016 (the canonical source); GO-FAIR; FAIRsharing; ELIXIR FAIR Cookbook; domain standards (MIAME, MINSEQE, MIAPE, MIRAGE, MIxS, REMBI, MIASE); major repositories (ENA, GEO/SRA, PRIDE, MetaboLights, BioImage Archive, EMDB, PDB, GenBank, Zenodo). Decision tree, 15-sub-principle table, and self-assessment rubric are original compositions.
    Good AI Tools