5 BigQuery Key Concepts
This page provides an overview and key concepts for working with GFW data in BigQuery, Google’s serverless, highly scalable enterprise data warehouse.
Information is organized in the following topics:
- BigQuery access and billing
- Projects and datasets
- Naming conventions
- Versioning
- Table formats
- Key concepts
- Best practices
5.1 BigQuery access and billing
5.1.1 Access
Working in BigQuery requires two forms of access:
- Google Cloud billing project: Using BigQuery incurs costs and BigQuery requires users be connected to a GCP billing project before running queries.
- The enabled billing project will be listed in the top menu bar of the BigQuery console
- The billing project for GFW staff is
gfw-google
- Research partners cannot be added to
gfw-google
and must create their own GCP billing project. GFW can help with this process and partners can generally receive ~$5,000 in free Google Credits through the Google for Good program. Contact Tyler (tyler@globalfishingwatch.org) or Andrea (andreasancheztapia@globalfishingwatch.org) for details
- BigQuery project: Data in BigQuery is organized in projects, which in turn contain datasets and tables. To access GFW data, users must be granted access to the GFW BigQuery project.
- GFW data lives in two primary BigQuery projects:
world-fishing-827
: GFW project containing all internal GFW datasets related to research and development.global-fishing-watch
: GFW project containing the output of automated GFW pipelines, as well as GFW public datasets.
- The project will appear in the left menu bar of the BigQuery console and clicking on the project name will expand to show the list of datasets the user can access.
- If you don’t see a BigQuery project in your console: Click on
+ ADD
->Search for and star a project / STAR a project by name
->Enter project name
-> typeworld-fishing-827
-> selectSTAR
- If you don’t see a BigQuery project in your console: Click on
- GFW data lives in two primary BigQuery projects:
5.1.2 Billing
BigQuery costs are incurred in two ways - storing and querying data. + Queries cost $6.25 per terabyte of data scanned + GFW staff can use the gfw-google
billing project. Queries using this project are “free” for GFW staff, but we need to be careful.
5.1.2.1 The BigQuery validator
The BigQuery validator is a built-in query debugger that also estimates query cost. The validator will turn green when a query is valid and be red otherwise. Always use the validator to check the cost of a query before running it!
5.2 GFW datasets
Within a BigQuery project, data is stored in tables and organized in datasets. There are many datasets in the world-fishing-827
and global-fishing-watch
projects. See the BigQuery Table Reference page for the list of key datasets.
5.3 Naming conventions
Many datasets and tables follow several key naming conventions. Abiding by these conventions will help make it easier for everyone to navigate the large amount of GFW data in BigQuery. It is beyond the scope of this document and unnecessary to prescribe naming conventions for every table. However, tables within a dataset should use clear and self-explanatory names.
5.3.1 General guidelines
- Underscores not dashes
- Lowercase
- Be clear, concise, and consistent
- Set a “time to live” (TTL) on all scratch_ datasets. TTL’s should be used with other datasets and tables when appropriate (see BigQuery docs).
5.3.2 Dataset names
Datasets are the top level containers of our data in BigQuery. They should have standardized names that are easy to understand and navigate. Before creating a new dataset, consider whether the tables could be placed in an existing dataset.
5.3.2.1 Prefixes
The following prefixes are used to organize datasets into groups:
pipe_[source]
: These datasets contain automated tables that are consumed or produced by a GFW pipeline. Thesource
indicates the type of data included in the pipeline.gfw_
: Datasets related to GFW’s research work and public dataproj_
/prj
: Project-specific datasets. Each project should receive its own dataset clearly named with the project topic (e.g.proj_[topic]
)scratch_
: These datasets are for preliminary tables generated during early development and should be user specific (e.g.scratch_tyler
).
VMS
: Datasets containing raw national VMS data.- When processed by the GFW VMS pipeline, data is saved in a
pipe_[country]
dataset
- When processed by the GFW VMS pipeline, data is saved in a
machine_learning_
: These datasets contain inputs/outputs of our machine learning models
Note: Some datasets may end with the suffix _ttl_[days]
. This indicates that the dataset has a “time to live” (TTL) set and tables in the dataset will be automatically deleted after the set period.
5.4 Versioning
5.4.1 Pipeline versioning
GFW strives to use a semantic versioning approach to version our data pipelines. These version numbers are indicated in the dataset name using the _v
flag following the pipe_[source]
prefix. For example, the pipe_ais_v3_published
dataset name indicates that the it contains published tables for the 3rd version of the AIS pipeline.
5.4.2 Date versioning
GFW is constantly receiving and processing new data. As a result, certain tables may be updated periodically to incorporate new data. To differentiate versions of the same table created at different points in time, GFW uses the _v[YYYYMMDD]
naming convention for the suffix of a table, where the YYYYMMDD
indicates the date the table was created. When multiple tables have the same name but end in different date strings, BigQuery will “stack” them together with the table name followed by a number in parentheses (e.g. vi_ssvid_v(13)
).
When a dataset or table ends in _v
followed by a date string (_v20201001
), the date refers to the version of the table. However, when there is no _v
preceeding a date string, the date indicates that that table includes data for that specific date, as explained in the next section on table formats.
Note: Using the latest view versus versioned tables For
product_events_{EVENT_TYPE}
tables we only preserve the most recent version and it is recommended to use the view version of the table rather than a specific dated version. In a future pipeline release we will remove the specific dated versions of these tables. It is generally recommended to use the latest view instead of versioned tables and the way tables are versioned may change in the future. Furthermore, access to versioned tables may be restricted to certain users in future pipeline releases.
5.4.3 Compatability tables
Occasionally, we need to apply breaking changes within a pipeline version. E.g. during the pipe3 lifetime we had to make a breaking change to the definition of product_events_port_visits
. When this happens, we add a versioning suffix, in this case product_events_port_visits_v2
. This “compatability table” will be maintained for the lifetime of the pipeline version and will replace the original table in the next pipeline version. In this case, product_events_port_visits_v2
will become product_events_port_visits
in pipe4 and the original product_events_port_visits
table will be removed.
5.5 Table formats
BigQuery tables can have several important formats that are used to organize data and minimize query cost and execution time. In particular, GFW creates tables that may be date sharded or date partitioned. Additionally, tables may be clustered on certain fields. Date sharded and date partitioned tables behave in similar ways but, crucially, require different filtering syntax to limit query size. Therefore, understanding how to recognize if a table is date sharded, partitioned, and/or clustered, and how to query them properly, is critical.
5.5.1 Date partitioned tables
Many GFW tables are date partitioned. These are single tables where data is stored in invisible partitions. If a table is partitioned, you will see This is a partitioned table
when viewing the table’s info in BigQuery. Tables are generally partitioned on an actual field in the table (e.g. date
or event_start
). You can check the partitioning specification by viewing the table’s details.
Queries against date partitioned tables can filter on the partitioned column using a WHERE
statement, limiting the amount of data scanned and thus the cost.
SELECT *
FROM pipe_ais_v3_published.messages
WHERE DATE(timestamp) = "2019-01-01"
5.5.2 Date sharded tables
Unlike date partitioned tables, date sharded tables are actually a collection of tables, one for each date, where the data for a given date is stored in its own table where the table name ends in a date string (e.g. messages_scored_20180101
). When there is data for more than one date, the table name will be followed by a number in parentheses, with the number indicating how many days of data are in the sharded table (e.g. messages_scored_(3248)
).
There are two main ways to restrict queries of date sharded tables depending on whether you’re querying one or multiple dates:
For a single date, you can specify the table directly using the full table name:
SELECT *
FROM pipe_ais_v3_internal.messages_scored_20180101
When querying multiple dates, you use a wildcard character (*
) in the table name and then use the _TABLE_SUFFIX
argument in the WHERE
statement to specify the date range:
SELECT *
FROM `world-fishing-827.pipe_ais_v3_internal.messages_scored_*`
WHERE _TABLE_SUFFIX BETWEEN '20180101' AND '20180102'
Note: Unlike date partitioned tables, use the YYYYMMDD
format when filtering date sharded tables. Additionally, when using the *
wildcard you must surround the table name in backticks.
5.5.3 Clustered tables
BigQuery tables can be clustered on certain fields. Similar to partitioning, clustering allows table data to be grouped together based on one or more table fields and improve query performance and cost when filtering on the clustered fields. You can check the clustering specification by viewing the table’s details.
Note: Queries that filter on a clustered field may be cheaper than what is estimated by the BigQuery validator.
5.6 Key concepts
BigQuery is powerful but complex tool and understanding certain key concepts can help get the most out of BigQuery.
5.6.1 User Defined Functions
User-defined functions (UDF) are functions created using another SQL expression or JavaScript. These functions accept columns of input, perform actions, and return a value. UDFs can be defined directly at the top of a query or stored as functions in a special type of dataset. GFW maintains a collection of helpful UDFs in the udfs
dataset in world-fishing-827
. These queries can be called in any query using the udfs.function()
syntax.
5.6.2 STRUCTS
and ARRAYS
Two key BigQuery data types used by GFW are ARRAYS
and STRUCTS
, which allow for nesting and/or repeating data.
An ARRAY
is an ordered list of zero or more elements of non-ARRAY values (e.g. STRING
, INT64
). An ARRAY
is listed as having a REPEATED
mode. GFW uses arrays to store multiple types of information, such as the list of identities (e.g. callsigns) broadcast by each MMSI over AIS.
A STRUCT
is a container of ordered fields each with a type (required) and field name (optional). A STRUCT
is listed as having a RECORD
type. Fields are nested within a STRUCT - e.g. struct_field.nested_field
. In general, GFW uses STRUCT
s to group together information from different sources. For example, the vessel info tables use organize vessel identities from AIS and registries in separate STRUCT
s.
5.6.3 Spatial BigQuery
BigQuery now supports a suite of powerful spatial analysis tools. These functions use the GEOGRAPHY
data type, which represents a pointset (points, lines, polygons) on WGS84. Single points can be described by a (longitude, latitude) pair. For describing more complex geographies (lines, polygons), BigQuery supports GeoJSON, Well-known text (WKT), and well-known binary (WKB) data formats in the GEOGRAPHY
column. GFW uses GEOGRAPHY
data types to represent several data types, including the location of “events” (e.g. encounters, port visits, fishing).
5.6.4 Data definition language
BigQuery supports the use of data definition language (DDL), which allow you to create, alter, and delete tables directly in your query statement. This is a particularly helpful way to create partitioned and clustered tables.
5.7 Best Practices
- Always use standard SQL - first line of your query should be
#standardSQL
- Write subqueries using
WITH
statement – avoid overly nested queries - Comment between queries to give space visually.
- Put all tables referenced at the top of the query.
- When using UDFs, try to use SQL based instead of JavaScript
- For numerical parameters that will be used in multiple places, you can use a UDF at the top of the query that returns the value
- To limit costs when developing or debugging queries:
- Restrict the date range or partitioned or sharded tables
- Use a
LIMIT 1000
clause at the end of a query. Comment this out when running for real
- For complex queries, consider including a (commented out) test query that does a sanity check