Why reproducibility matters in GTA workflows
1. Why reproducibility matters
Scientific analyses must be transparent, traceable, and repeatable.
In fisheries data, reproducibility ensures that:
- Other researchers can validate results.
- Policy-makers can rely on evidence-based advice.
- Updates to datasets can be compared to earlier versions.
The FAIR principles (Findable, Accessible, Interoperable, Reusable) provide a framework to guarantee these goals.
2. How reproducibility is implemented in GTA workflows
RStudio Server on VLab5
VLab5 provides a uniform RStudio environment (same RStudio version, OS image, system libs), which avoids “it works on my machine” issues and makes demos consistent across learners. Use it when you want:
- A pre-configured R/RStudio environment for everyone.
- Stable system libraries (GDAL/PROJ, curl, etc.) aligned with the course.
- Shared instructions: the same scripts and relative paths work for all users.
Alternative: you can run exactly the same steps locally in RStudio; the workflows and scripts are identical.
DOIs and Zenodo
DOIs (Digital Object Identifiers) provide a permanent, citable identifier for each dataset release.
In GTA, every public release is archived on Zenodo, which mints:
- a version DOI (e.g., 10.5281/zenodo.15496164) for that exact snapshot, and
- a concept DOI (stable “latest” record) that always points to the newest version.
Why we use them
- Stable citation in papers and reports,
- Long-term preservation of files and metadata,
- Clear versioning for reproducibility (each run points to a specific DOI).
How to cite (example) Use the version DOI in methods/results, and the concept DOI in general references.
Global Tuna Atlas (IRD release). Level-2 geo-referenced catch, Zenodo,
DOI: 10.5281/zenodo.15496164 (versioned); concept DOI: 10.5281/zenodo.15496164
BibTeX (template)
@dataset{gta_level2_2025,
title = {Global Tuna Atlas - Level 2 Geo-referenced Catch},
author = {{GTA Team}},
year = {2025},
version = {2025.1},
doi = {10.5281/zenodo.15496164},
url = {https://doi.org/10.5281/zenodo.15496164},
publisher = {Zenodo},
note = {Use the version DOI for exact reproducibility}
}Dependency management with renv
renv captures the exact package versions used in the project (renv.lock), making analyses reproducible across sessions and machines. https://rstudio.github.io/renv/
Restore the environment (recommended):
# Restore all packages declared in renv.lock
renv::restore()Docker images
The following datasets: Level 2 geo‑referenced, effort ship a Dockerfile that encapsulates:
- Code (repositories + pinned versions),
- R packages (via
renv), and - Build scripts that produce the final dataset.
This guarantees that a dataset release can be recreated bit‑for‑bit provided the same inputs.
At this stage, Level 0 and effort datasets still rely on datasets manually collected from tRFMO websites, some of which are stored in a private Google Drive folder and not publicly accessible. Consequently, only Level 2 datasets can currently be fully built using open DOI-based sources, without relying on restricted files.
3. What is not reproducible
- The direct database connection cannot be fully shared:
- Requires passwords that cannot be distributed publicly.
- Risk of overload if too many users connect simultaneously.
- Requires passwords that cannot be distributed publicly.
Instead, reproducibility is achieved by providing:
- Exported datasets (via Zenodo DOIs).
- Scripts that rebuild processing steps.
Each user could, if needed, set up their own database instance locally.
Key takeaways
- Docker + renv ensure anyone can rebuild with the same environment.
- Each dataset ships with a Dockerfile and scripts to generate the release.
- Level 0 and effort datasets are partially dependent on non-public sources (e.g., Google Drive).