31 GFW Pipelines
A GFW “pipeline” refers to an automated process by which GFW converts raw AIS or other data into the datasets and tables used for research and analysis. Pipelines are a multi-step process that includes ingesting and cleaning raw data, applying GFW algorithms, and saving data to BigQuery tables and/or other locations.
31.1 AIS pipeline
The AIS pipeline is GFW’s primary pipeline and includes the following processes:
- Normalize and combine raw AIS data from Orbcomm and Spire
- Remove noise and group positions into segments (e.g. the “segmenter”)
- Predict which positions are likely fishing positions
- Identify various event types (e.g. transshipment, port visits, fishing, AIS gaps, etc.)
AIS data tables and AIS event tables produced by a given version of the AIS pipeline are stored within a BigQuery dataset named pipe_production_vYYYYMMDD
, where the YYYYMMDD
refers to the date the pipeline was created. Additionally, AIS research tables, which are designed for research and analysis, are stored in gfw_research
, with each table including the AIS pipeline version (YYYYMMDD
) in its name (e.g. pipe_vYYYYMMDD
).
31.1.1 AIS data tables
AIS data tables are tables produced by the AIS pipeline that contain a version or summary of the entire AIS dataset. The first type of AIS data tables are those that contain a complete copy of the AIS data output during various steps in the pipeline:
position_messages_YYYYMMDD
messages_segmented_YYYYMMDD
spatial_measures_YYYYMMDD
features_YYYYMMDD
fishing_score_YYYYMMDD
messages_scored_YYYYMMDD
For example, the position_messages_YYYYMMDD
table is output following data normalization, and the messages_segmented_YYYYMMDD
table is the output of the segmenter stage of the AIS pipeline. The final output is the messages_scored_
table, which contains all AIS position messages and their fishing score.
The other group of AIS data tables include tables that summarize the complete AIS data by segment or vessel. These tables may be further organized into daily date sharded tables and overall tables. Daily summary tables are considered “static”, meaning their data is fixed, while overall summary tables are considered dynamic because their data is subject to change over time as new data are incorporated.
segment_vessel_daily_YYYYMMDD
andsegment_vessel
segment_info
segment_identity_daily_YYYYMMDD
vessel_info
31.1.1.1 vessel_id
A key feature of the AIS data tables is vessel_id
, which is intended as a unique ID for an identity in the AIS data and is used to link AIS positions to an identity and AIS events…[EXPAND].
31.1.2 AIS research tables
Because AIS data tables, specifically messages_scored_
, contain all AIS positions, they are extremely large tables and expensive to query. However, AIS messages are often broadcast every few seconds, which is much finer temporal resolution than needed for most research and analysis work. Therefore, the GFW AIS pipeline produces a series of “research tables” designed specifically to be more efficient for research/analysis. These tables start with the prefix research_
.
Key details of the research tables include the following: + AIS positions in the main research table (research_messages
) are thinned to one position per minute per segment + Every AIS position is assigned an amount of time (hours
), which is equal to the time since the previous position in the segment + There are two different fishing score variables - nnet_score
and night_loitering
. Use night_loitering
for squid_jiggers
and nnet_score
for all other fishing classes.
31.1.3 AIS event tables
Lastly, the AIS pipeline includes a series of algorithms that identify certain vessel behaviors, referred to as events: + Encounters + Loitering + Port visits and voyages + Fishing events
31.2 VMS pipeline
GFW also creates individual VMS pipelines to process national VMS data. The output tables of the VMS pipelines mirror the data and event tables from the AIS pipeline and are stored in datasets named pipe_[country]_production_vYYYYMMDD
. The VMS pipeline does not (yet) output VMS research tables.