Global Tuna Atlas datasets and workflow architecture
Overview of the Global Tuna Atlas datasets and workflow architecture
The Global Tuna Atlas (GTA) integrates catch and effort data from all five tuna Regional Fisheries Management Organizations (t-RFMOs). GTA workflows aim to make tuna fisheries data FAIR (Findable, Accessible, Interoperable, Reusable) and reproducible.
The dataset families include:
- Nominal catches (annual, aggregated by fleet, gear, species, large area).
- Geo-referenced catches (monthly, 1°/5° grids).
- Effort datasets (multiple measurement units, kept unaggregated to preserve semantics).
- Derived CPUE datasets (catch-per-unit-effort on matched strata).
Each dataset is produced using harmonized formats based on CWP standards and reproducible scripts available in:
- Level 2 & Effort (IRD): https://github.com/firms-gta/geoflow-tunaatlas
- Level 0 (FIRMS): https://github.com/firms-gta/geoflow-gta
Presentation of dataset specificities
Nominal Catch Dataset (FIRMS Level 0)
- Time span: 1918–2023
- Content: Live-weight equivalent (metric tonnes), mainly retained catches
- Stratification: Year, fleet, gear, large area, species
- Use case: Benchmark of global tuna catch volumes
- DOI: https://doi.org/10.5281/zenodo.5745958
Geo-referenced Catch Datasets
FIRMS Level 0
- Time span: 1950–2023
- Resolution: 1°/5° grid, monthly
- Content: Catches from all t-RFMOs, harmonized
- DOI: https://doi.org/10.5281/zenodo.5747174
IRD Level 2
- Further processed to align geo-referenced totals with nominal totals
- Includes raising, strata updates, and handling of observer vs logbook data (e.g., IATTC)
- Warning: Not suitable for quota/legality studies due to uncertainty in precise georeferencing
- Code: https://github.com/firms-gta/geoflow-tunaatlas
- DOI: https://doi.org/10.5281/zenodo.15496164
Effort Dataset (IRD Level 0)
- Time span: 1950–2023
- Content: 23 different measurement units; no aggregation
- Specificities: Known duplicates/parallel series (e.g., ICCAT Hours.FAD vs Hours.FSC; WCPFC SETS vs DAYS)-end-users must harmonize for their analysis
- DOI: https://doi.org/10.5281/zenodo.15496164
Exploration resources
Example Rmd notebooks (in the L2/Effort repo):
- Exploring Level 2 catch: https://github.com/firms-gta/geoflow-tunaatlas/blob/master/summary_catch.Rmd
- Exploring Level 0 effort: https://github.com/firms-gta/geoflow-tunaatlas/blob/master/summary_effort.Rmd
- Exploring Level 2 catch: https://github.com/firms-gta/geoflow-tunaatlas/blob/master/summary_catch.Rmd
Shiny apps:
GTA Workflow Architecture (Level 0 → Level 2)
- Level 0 → Level 2: Level 2 datasets are derived products built from Level 0 (harmonized sources) by applying documented processing (e.g., raising to nominal totals, strata updates, harmonization, QA/QC). Level 2 aligns better with nominal totals but is not appropriate for quota/legality studies because precise geo-referencing uncertainty can remain.
Per-release PDF (impact of each step)
For each IRD Zenodo release, the dataset is built using the workflow script
create_global_tuna_atlas_dataset_v2025.R from the firms-gta/geoflow-tunaatlas repository.
During the build, the CWP.dataset R package generates an automated PDF report that:
- lists every workflow step (ingest → harmonize → raise → aggregate/align → QA/QC),
- records the parameters and assumptions used,
- and quantifies the impact on the data (e.g., percent changes, coverage adjustments, strata updates).
Example: https://zenodo.org/record/15496164/files/Recap_of_the_process.pdf
This documentation complements the machine-readable provenance stored in flow definitions and logs.
Workflow automation with geoflow
To ensure reproducibility and traceability, GTA workflows are implemented with the geoflow R package, which structures pipelines into tasks with explicit inputs, outputs, and metadata.
In practice, the list of datasets to build and publish is maintained in a machine-readable entities table (CSV).
For GTA, see: geoflow_entities_tuna_global_datasets_IRD_level1_2022 - IRD_level2.csv.
The entities CSV drives each run by declaring, for every dataset (“entity”):
- Identification & description (id, title, abstract)
- Inputs (DOI URLs, RFMO downloads, private paths when applicable)
- Processing hints (task names/parameters, expected temporal/spatial aggregations)
- Outputs (target formats/locations)
- Publication targets (Zenodo community/concept DOI; optionally GeoServer/GeoNetwork/database)
- Metadata (license, keywords, contacts, temporal/spatial coverage, version/release info)
Execution flow with geoflow:
- Read the entities CSV and instantiate the flow of tasks per entity
- Run steps (ingest → harmonize → raise → aggregate → QA/QC)
- Export data and generate human-readable reports
- Publish outputs with DOIs and record machine-readable provenance
Benefits:
- Precise tracking of data provenance (which task produced what, with which parameters)
- Automated export and DOI-based publication (Zenodo)
- Optional integration with GeoServer/GeoNetwork and databases
- A single, versioned CSV “source of truth” for datasets and metadata
Quick run patterns
# 2) Create Level 2 datasets
# - Downloads required DOI resources when needed
# - Applies raising and strata updates to align with nominal totals
# - Generates an impact PDF via CWP.dataset (per-release)
source("level_2_catch_local.R")
Key takeaways
- Level 2 is derived from Level 0 through documented, reproducible steps (geoflow).
- IRD Zenodo records include a PDF documenting each step and its impact on data.
- The entities CSV defines what to build/publish and underpins traceable, automated runs.