4 Big Query Key concepts
This page provides an overview and key concepts for working with GFW data in BigQuery, Google’s serverless, highly scalable enterprise data warehouse.
Information is organized in the following topics:
- BigQuery access and billing
- Projects and datasets
- Naming conventions
- Versioning
- Table formats
- Key concepts
- Best practices
4.1 BigQuery access and billing
4.1.1 Access
Working in BigQuery requires two forms of access:
- Google Cloud billing project: Using BigQuery incurs costs and BigQuery requires users be connected to a GCP billing project before running queries.
- The enabled billing project will be listed in the top menu bar of the BigQuery console
- The billing project for GFW staff is
gfw-google
- Research partners cannot be added to
gfw-google
and must create their own GCP billing project. GFW can help with this process and partners can generally receive ~$5,000 in free Google Credits through the Google for Good program. Contact Tyler (tyler@globalfishingwatch.org) for details
- BigQuery project: Data in BigQuery is organized in projects, which in turn contain datasets and tables. To access GFW data, users must be granted access to the GFW BigQuery project.
- GFW data lives in two primary BigQuery projects:
world-fishing-827
: Primary GFW project containing all internal GFW datasets
global-fishing-watch
: Contains GFW public fishing effort tables
- The project will appear in the left menu bar of the BigQuery console and clicking on the project name will expand to show the list of datasets the user can access.
- If you don’t see a BigQuery project in your console: Click on
ADD DATA
->Pin a project
->Enter project name
-> typeworld-fishing-827
-> selectPIN
- If you don’t see a BigQuery project in your console: Click on
- GFW data lives in two primary BigQuery projects:
4.1.2 Billing
BigQuery costs are incurred in two ways - storing and querying data. + Queries cost $5 per terabyte of data scanned + Queries are “free” for GFW staff, but we need to be careful
4.1.2.1 The BigQuery validator
The BigQuery validator is a built-in query debugger that also estimates query cost. The validator will turn green when a query is valid and be red otherwise. Always use the validator to check the cost of a query before running it!
4.2 GFW datasets
Within a BigQuery project, data is stored in tables and organized in datasets. There are many datasets in the world-fishing-827
project, but the following are the primary datasets for research and analysis:
pipe_production_vYYYYMMDD
: AIS pipeline datasetpipe_[country]_production_vYYYYMMDD
: Country-specific VMS pipeline datasets > + VMS data can only be used by partners with written permission from the relevant national authorities (see VMS for more info).
gfw_research
: Contains a variety of static tables useful for analysis and early versions of new research conceptsvessel_database
: Contains tables related to vessel registry informationanchorages
: Contains the GFW anchorages dataset
If you don’t see at least the pipe_production_vYYYYMMDD
, gfw_research
, vessel_database
, and anchorages
dataset let Tyler or Enrique know.
4.3 Naming conventions
Many datasets and tables in the world-fishing-827
project follow several key naming conventions. Abiding by these conventions will help make it easier for everyone to navigate the large amount of data in world-fishing-827
. It is beyond the scope of this document and unnecessary to prescribe naming conventions for every table. However, tables within a dataset should use clear and self-explanatory names.
4.3.1 General guidelines
- Underscores not dashes
- Lowercase
- Be clear, concise, and consistent
- Set a “time to live” (TTL) on all scratch_ datasets. TTL’s should be used with other datasets and tables when appropriate (see BigQuery docs).
4.3.2 Dataset names
Datasets are the top level containers of our data in BigQuery. They should have standardized names that are easy to understand and navigate. Before creating a new dataset, consider whether the tables could be placed in an existing dataset.
4.3.2.1 Prefixes
The following prefixes are used to organize datasets into groups:
pipe_
: These datasets contain automated tables that are consumed or produced by a GFW pipeline.gfw_
: Datasets related to GFW’s research work and public dataproj_
: Project-specific datasets. Each project should receive its own dataset clearly named with the project topic and/or client (e.g.proj_[topic]_[client]
)scratch_
: These datasets are for preliminary tables generated during early development and should be user specific (e.g.scratch_tyler
)VMS
: Datasets containing raw national VMS data.- When processed by the GFW VMS pipeline, data is saved in a
pipe_[country]
dataset
- When processed by the GFW VMS pipeline, data is saved in a
machine_learning_
: These datasets contain inputs/outputs of our machine learning models
Note: Some datasets may end with the suffix _ttl_[days]
. This indicates that the dataset has a “time to live” (TTL) set and tables in the dataset will be automatically deleted after the set period.
4.4 Versioning
GFW receives new data every day and is constantly working to improve our algorithms and models. As a result, there can be multiple versions of a data product in BigQuery at the same time (e.g. the vessel database). To differentiate versions, GFW uses the _v[YYYYMMDD]
naming convention for the suffix of a dataset or table, where the YYYYMMDD
indicates the date the table was created. When multiple tables have the same name but end in different date strings, BigQuery will “stack” them together with the table name followed by a number in parentheses (e.g. vi_ssvid_v(13)
).
When a dataset or table ends in _v
followed by a date string (_v20201001
), the date refers to the version of the table.
4.5 Table formats
BigQuery tables can have several important formats that are used to organize data and minimize query cost and execution time. In particular, GFW creates tables that may be date sharded or date partitioned. Additionally, tables may be clustered on certain fields. Date sharded and date partitioned tables behave in similar ways but, crucially, require different filtering syntax to limit query size. Therefore, understanding how to recognize if a table is date sharded, partitioned, and/or clustered, and how to query them properly, is critical.
4.5.1 Date sharded tables
Date sharded tables are actually a collection of tables, one for each date, where the data for a given date is stored in its own table where the table name ends in a date string (e.g. messages_scored_20180101
). When there is data for more than one date, the table name will be followed by a number in parentheses, with the number indicating how many days of data are in the sharded table (e.g. messages_scored_(3248)
).
There are two main ways to restrict queries of date sharded tables depending on whether you’re querying one or multiple dates:
For a single date, you can specify the table directly using the full table name:
SELECT *
FROM pipe_production_v20201001.messages_scored_20180101
When querying multiple dates, you use a wildcard character (*
) in the table name and then use the _TABLE_SUFFIX
argument in the WHERE
statement to specify the date range:
SELECT *
FROM `world-fishing-827.pipe_production_v20201001.messages_scored_*`
WHERE _TABLE_SUFFIX BETWEEN '20180101' AND '20180102'
Note: When using the *
wildcard you must surround the table name in backticks.
4.5.2 Date partitioned tables
Similar to date sharded tables, many GFW tables are date partitioned. These are single tables where data is stored in invisible partitions. If a table is partitioned, you will see This is a partitioned table
when viewing the table’s info in BigQuery. Tables can be partitioned on an actual field in the table (e.g. date
) or on the _partitiontime
field, which is a “psuedo colum” that is not actually in the table. You can check the partitioning specification by viewing the table’s details.
Queries against date partitioned tables can filter on the partitioned column using a WHERE
statement, limiting the amount of data scanned and thus the cost.
SELECT *
FROM gfw_research.pipe_v20201001
WHERE _partitiontime = "2019-01-01"
Note: Unlike date sharded tables, use the YYYY-MM-DD
format when filtering date partitioned tables.
4.5.3 Clustered tables
BigQuery tables can be clustered on certain fields. Similar to partitioning, clustering allows table data to be grouped together based on one or more table fields and improve query performance and cost when filtering on the clustered fields. You can check the clustering specification by viewing the table’s details.
Note: Queries that filter on a clustered field may be cheaper than what is estimated by the BigQuery validator.
4.6 Key concepts
BigQuery is powerful but complex tool and understanding certain key concepts can help get the most out of BigQuery.
4.6.1 User Defined Functions
User-defined functions (UDF) are functions created using another SQL expression or JavaScript. These functions accept columns of input, perform actions, and return a value. UDFs can be defined directly at the top of a query or stored as functions in a special type of dataset. GFW maintains a collection of helpful UDFs in the udfs
dataset in world-fishing-827
. These queries can be called in any query using the udfs.function()
syntax.
4.6.2 STRUCTS
and ARRAYS
Two key BigQuery data types used by GFW are ARRAYS
and STRUCTS
, which allow for nesting and/or repeating data.
An ARRAY
is an ordered list of zero or more elements of non-ARRAY values (e.g. STRING
, INT64
). An ARRAY
is listed as having a REPEATED
mode. GFW uses arrays to store multiple types of information, such as the list of identities (e.g. callsigns) broadcast by each MMSI over AIS.
A STRUCT
is a container of ordered fields each with a type (required) and field name (optional). A STRUCT
is listed as having a RECORD
type. Fields are nested within a STRUCT - e.g. struct_field.nested_field
. In general, GFW uses STRUCT
s to group together information from different sources. For example, the vessel info tables use organize vessel identities from AIS and registries in separate STRUCT
s.
4.6.3 Spatial BigQuery
BigQuery now supports a suite of powerful spatial analysis tools. These functions use the GEOGRAPHY
data type, which represents a pointset (points, lines, polygons) on WGS84. Single points can be described by a (longitude, latitude) pair. For describing more complex geographies (lines, polygons), BigQuery supports GeoJSON, Well-known text (WKT), and well-known binary (WKB) data formats in the GEOGRAPHY
column. GFW uses GEOGRAPHY
data types to represent several data types, including the location of “events” (e.g. encounters, port visits, fishing).
4.6.4 Data definition language
BigQuery supports the use of data definition language (DDL), which allow you to create, alter, and delete tables directly in your query statement. This is a particularly helpful way to create partitioned and clustered tables.
4.7 Best Practices
- Always use standard SQL - first line of your query should be
#standardSQL
- Write subqueries using
WITH
statement – avoid overly nested queries - Comment between queries to give space visually.
- Put all tables referenced at the top of the query.
- When using UDFs, try to use SQL based instead of JavaScript
- For numerical parameters that will be used in multiple places, you can use a UDF at the top of the query that returns the value
- To limit costs when developing or debugging queries:
- Restrict the date range or partitioned or sharded tables
- Use a
LIMIT 1000
clause at the end of a query. Comment this out when running for real
- For complex queries, consider including a (commented out) test query that does a sanity check