2  BigQuery Datasets

Modified

August 26, 2025

GFW data is stored as BigQuery tables and organized in a handful of key datasets. These datasets contain the automated output of GFW’s data pipelines as well as manually generated tables from development and research projects.

2.1 Staging Process

Description: In an effort to delineate new versions of datasets with those that are used in production we are currently developing a new staging process. **

Details: The staging process includes the following:

  1. Proof-of-Concept data:
  • Internal only and selected research partners
  • One-off or manually updated tables
  • Manual QA
  • Stored in scratch, project specific, or gfw_research datasets
  1. Prototype data:
  • Intermediate and Prototype tables
  • Automated data runs (updated daily)
  • Semi-automated QA
  • Prototype tables should use convention of proto_
  • Can be used by researchers and analysts but proceed knowing that although QA’d all bugs may not have been solved
  • Basic documentation will exist for these datasets
  • Stored in pipeline datasets (e.g. pipe_ais_v3_published) with the proto_ prefix.
  1. Production data:
  • Production level ready datasets
  • Public/research partners
  • Automated_QA
  • When proto_ tables have been used and QA’d for a given period of time, have finalized documentation, and automated QA metrics they are considered production data and the proto_ prefix is dropped.

2.2 AIS PIPE-3

The dataset structure was entirely revamped in PIPE-3 and simplified by generating all AIS tables in two datasets.

2.2.1 pipe_ais_v3_internal

  • Description: This dataset contains all intermediate tables that are generated by the AIS pipeline. These tables are used to generate the final tables in pipe_ais_v3_published. The tables in this dataset are not intended to be shared outside of GFW and usage within GFW should be limited to those who are working on the pipeline.

  • When to use: Only use this dataset if you are working on the AIS pipeline, e.g. as a researcher or data engineer. The intermediate tables in this dataset can also be used for debugging purposes.

2.2.2 pipe_ais_v3_published

  • Description: This dataset contains the final output from the AIS pipeline. The tables in this dataset are intended to be shared with GFW partners and the public.

  • When to use: This dataset is the primary dataset to use for AIS-based analyses by GFW staff and external users.

2.3 Non-AIS Pipelines

Besides AIS, we have a number of other pipelines that currently follow the same structure as the AIS pipeline:

  • Sentinel-1: pipe_sar_v1_internal and pipe_sar_v1_published
  • Sentinel-2: TBD (currently under development)
  • VMS: pipe_vms_v3_internal and pipe_vms_v3_published

2.4 Development Datasets

2.4.1 gfw_research

  • Description: This dataset is intended as a general location to store proof-of-concept (PoC) tables developed by the Research Team that have not yet been incorporated into an automated pipeline. Many PoC tables originate in specific project datasets (e.g. prj_vessel_classification) but not all users have access to all project datasets. Once a PoC has reached a level of stability, it can be copied to gfw_research to facilitate access, as most GFW users and research partners have access to gfw_research. This dataset also includes various reference tables useful for analysis (e.g. EEZ info, flags of convenience, country codes, ice masks, etc.).

A related dataset to gfw_research is the gfw_research_precursors dataset. Like the _internal datasets described above, gfw_research_precursors is intended simply as a storage location for intermediate tables that are generally not required for analysis.

  • When to use: Look to this dataset when interested in accessing the latest stable versions of PoC tables being developed by the Research Team.

2.4.2 scratch_[GFW_USERNAME]_ttl[x]

  • Description: Each GFW user has their own “scratch” dataset for development and testing. Tables in these datasets are assigned a “time to live” (ttl) and get deleted after a certain amount of days, so for long-term projects you should create new project-specific datasets.

2.5 Other Relevant Datasets

2.5.1 pipe_ais_sources_v20201001 and pipe_ais_sources_v20220628

These two datasets contain the raw AIS data that is used to run any AIS pipeline. The only processing that has been applied to tables in these datasets is: 1. parsing (decoding) of raw nmea data; and 2. normalization of the data to a common schema as well as deduplication of messages from different data providers.

Historically, they have not been part of the AIS pipeline itself since they are usually not being regenerated upon a new AIS pipeline release.

2.5.2 anchorages

The anchorages pipeline generates tables based on AIS data as well as static datasources of anchorages and their names. This dataset is used across different pipelines, like AIS and VMS.

2.5.3 pipe_static

This dataset contains static tables that are being generated based on static data sources. This data is being used across most pipelines to add geographical information like distance from shore or depth. WARNING: for some tables it is unclear what the underlying data source is!

2.5.4 pipe_regions_layers

This dataset contains region lookup tables that define the shape of each region. Those are occasionally updated due to corrections, regional disputes, and new or shifting regions. Like the static tables, this dataset is used across most pipelines to provide information in which regions a certain coordinate is located.

2.5.5 tech_dq_monitoring and tech_anomaly_detection

These two datasets contain tables that are used to monitor the data quality of all GFW data. The data may be used for debugging or adhoc analyses but since their primary purpose is to power dashboards and anomaly alerts the data does not follow the same reliability standards we have for other pipelines. Therefore, data might be outdated or unavailable at times.