25 HOW TO - Do point-in-polygon test in BigQuery (BQ) using regions
Owner: Hannah Linder Last edited time: 23 de mayo de 2024 2:14 Created time: 1 de marzo de 2024 13:10
[WORK IN PROGRESS]
This guide describes how to use the regions tables that are present in bigquery to do ad-hoc analysis to determine which of a set of lat/lon values are inside the set of regions.
25.0.1 Architecture
We use sets of polygons in the data pipeline to determine when events (fishing events, encounter events, etc) are located inside pre-determined polygons, such as MPAs, EEZs and the like. These sets of polygons are called “regions” and all the code that is used to create and use them is in the pipe-regions github repo.
Each layer is loaded into bigquery in a separate layer. The production dataset is world-fishing-827.pipe_regions_layers
. For example, the WDPA layer containing all the marine MPAs is in world-fishing-827.pipe_regions_layers.WDPA_Mar_2022_Marine
. All the associated fields from the original source are present in that table, so if you want to filter to a subset you could do something like WHERE status="Proposed"
.
BTW documentation for all these layers is right here in notion.
25.0.2 Adding Regions
Let’s say you have a big query table full of records with positions (lat, lon) and for each position record you want to get all of the region polygons that contain the position and put that in a new big query table
The following query will give you a new table with all the same rows as the source_table, with one additional field called regions which is a JSON string with the ids of the containing regions.
CREATE TEMP FUNCTION s2_level() AS (10);
WITH
as (
positions SELECT
*,
FROM `{ source_table }`
),as (
regions SELECT
*,
FROM `{ regions_table }`
),as (
gridded_positions SELECT
*,
as s2_cell
S2_CELLIDFROMPOINT(geo, s2_level()) FROM positions
),as (
gridded_regions SELECT
* except(s2_cells),
FROM regions
CROSS JOIN UNNEST (s2_cells) as s2_cell
),as (
create_position_region_matches SELECT
id, layer,
msgid, FROM gridded_positions p
JOIN gridded_regions r
ON p.s2_cell = r.s2_cell
WHERE ST_INTERSECTS(p.geo, r.geo)
),as (
position_by_layer_by_id SELECT
layer, id
msgid, FROM create_position_region_matches
GROUP BY msgid, layer, id
),as (
position_by_layer SELECT
msgid, CONCAT('"', layer, '":' ,TO_JSON_STRING(array_agg(id))) as json_fragment
FROM position_by_layer_by_id
GROUP BY msgid, layer
),as (
position_with_region
SELECRT
msgid,CONCAT( "{", STRING_AGG(json_fragment, ","), "}")) as regions
PARSE_JSON(FROM position_by_layer
GROUP BY msgid
),as (
full_position_with_region SELECT
*
FROM positions
LEFT JOIN position_with_region
USING (msgid)
)
SELECT * FROM full_position_with_region