Why reproducibility matters in GTA workflows

Author

VLab Course Team

1. Why reproducibility matters

Scientific analyses must be transparent, traceable, and repeatable.

In fisheries data, reproducibility ensures that:

Other researchers can validate results.
Policy-makers can rely on evidence-based advice.
Updates to datasets can be compared to earlier versions.

The FAIR principles (Findable, Accessible, Interoperable, Reusable) provide a framework to guarantee these goals.

2. How reproducibility is implemented in GTA workflows

RStudio Server on VLab5

VLab5 provides a uniform RStudio environment (same RStudio version, OS image, system libs), which avoids “it works on my machine” issues and makes demos consistent across learners. Use it when you want:

A pre-configured R/RStudio environment for everyone.
Stable system libraries (GDAL/PROJ, curl, etc.) aligned with the course.
Shared instructions: the same scripts and relative paths work for all users.

Alternative: you can run exactly the same steps locally in RStudio; the workflows and scripts are identical.

DOIs and Zenodo

DOIs (Digital Object Identifiers) provide a permanent, citable identifier for each dataset release.
In GTA, every public release is archived on Zenodo, which mints:

a version DOI (e.g., 10.5281/zenodo.15496164) for that exact snapshot, and
a concept DOI (stable “latest” record) that always points to the newest version.

Why we use them

Stable citation in papers and reports,
Long-term preservation of files and metadata,
Clear versioning for reproducibility (each run points to a specific DOI).

How to cite (example) Use the version DOI in methods/results, and the concept DOI in general references.

Global Tuna Atlas (IRD release). Level-2 geo-referenced catch, Zenodo,
DOI: 10.5281/zenodo.15496164 (versioned); concept DOI: 10.5281/zenodo.15496164

BibTeX (template)

@dataset{gta_level2_2025,
  title   = {Global Tuna Atlas - Level 2 Geo-referenced Catch},
  author  = {{GTA Team}},
  year    = {2025},
  version = {2025.1},
  doi     = {10.5281/zenodo.15496164},
  url     = {https://doi.org/10.5281/zenodo.15496164},
  publisher = {Zenodo},
  note    = {Use the version DOI for exact reproducibility}
}

Dependency management with `renv`

renv captures the exact package versions used in the project (renv.lock), making analyses reproducible across sessions and machines. https://rstudio.github.io/renv/

Restore the environment (recommended):


# Restore all packages declared in renv.lock
renv::restore()

Docker images

The following datasets: Level 2 geo‑referenced, effort ship a Dockerfile that encapsulates:

Code (repositories + pinned versions),
R packages (via renv), and
Build scripts that produce the final dataset.

This guarantees that a dataset release can be recreated bit‑for‑bit provided the same inputs.

At this stage, Level 0 and effort datasets still rely on datasets manually collected from tRFMO websites, some of which are stored in a private Google Drive folder and not publicly accessible. Consequently, only Level 2 datasets can currently be fully built using open DOI-based sources, without relying on restricted files.

3. What is not reproducible

The direct database connection cannot be fully shared:
- Requires passwords that cannot be distributed publicly.
- Risk of overload if too many users connect simultaneously.

Instead, reproducibility is achieved by providing:

Exported datasets (via Zenodo DOIs).
Scripts that rebuild processing steps.

Each user could, if needed, set up their own database instance locally.

Key takeaways

Docker + renv ensure anyone can rebuild with the same environment.
Each dataset ships with a Dockerfile and scripts to generate the release.
Level 0 and effort datasets are partially dependent on non-public sources (e.g., Google Drive).