How to use a custom query to access your BigQuery data?

Overview

In this tutorial you will learn how to write BigQuery SQL query and then use it as a custom query to create a Data Source in the Google Data Studio.
To understand the following steps it makes sense to gain an understanding of the structure of your data Ingenious provides in BigQuery.

It consists of two types of tables:

Fact tables (operational data)
- Ad Impressions
- Clicks
- Untracked Requests
- Conversions
- etc
Dimension tables (master data)
- Advertisers
- Partners
- Entities
- Ad Spaces
- Conversion-Targets
- Tracking-Categories
- Creatives
- Vouchers
- etc

More about facts and dimensions: Difference between Fact Table and Dimension Table - GeeksforGeeks

Google Big Query has a column-based table model. When querying data, only the columns specified in the "Select" are physically read. We have built our DWH models as "wide" as possible and have already joined a lot of master data where possible. A JOIN combines rows in two tables that have matching IDs into a single row. For this reason, you should never use "Select *", because a row can have a lot of columns, which can result in enormous query volumes. Think about what data you really need for an analysis.

The jointed master data is stored next to the original Id in a struct. We always try to use the same terminology. Here is a short summary of used acronyms.

Table name	Id name	Acronym

Table name	Id name	Acronym
platform_network	networkId	net
partnerships_advertiser	advertiserId	adv
partnerships_partner	partnerId	prt
partnerships_adspace	adspaceId	asp
trail_channel	channelId or trackingChannelId	ch
trail_conversiontarget	conversionTargetId	tc

When creating analyses, you should always provide all relevant columns in a model for graphical reporting tools. This saves you complicated joins in the reporting tools.

We may not have added master data to some models. This can be due to various reasons and you are welcome to bring this to our attention and we will look into the matter. You can always join the master data yourself and here we explain how this works.

Building the Query step by step

For better clarity we will build the Query bottom up using Common Table Expressions (CTE, see https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#with-clause ) which looks like

WITH alias AS (        
  SELECT * FROM first_table
)
SELECT * FROM alias;

This allows us to threat the query like a pipeline and go step by step from simply to more complicated query.

1. Filter the fact table by partitioning field

With the first query you can generally access fact tables. In most cases, fact tables in BigQuery are partitioned to increase query efficiency and reduce costs. Therefore, it is always necessary to specify a restriction in the "WHERE" part of the SQL query. Your fact tables are partitioned by date, i.E. the table finance_conversion is partitioned by the date of tracked timestamp trackedAt.

This query selects all conversions tracked on 1st of January 2020 and displays a 100 rows

SELECT *
FROM `i19s-insights.insights.finance_conversion`
WHERE DATE(trackedAt) BETWEEN '2020-01-01' AND '2020-01-01'
LIMIT 100;

2. Join a dimension to the fact table

This example shows how dimension tables are joined to fact tables. In our data warehouse we have already joined most of the existing dimensions. However, we believe that you should stay with this documentation as it is a good template for learning about data structure and methods of working with nested records.

The table "platform_network" under "net" and the table "platform_advertiser" under "adv" have already been joined to the table "finance_conversion". We assume in the following that this has not yet happened.

WITH
  fact_raw AS (
    SELECT conversionId, networkId
    FROM `i19s-insights.insights.finance_conversion`
    WHERE DATE(trackedAt) BETWEEN '2020-01-01' AND '2020-01-01'
    LIMIT 100
  ),
  net AS (  --> 1
    SELECT networkId, --> 2
      name, logoUrl, CONCAT(name, ' (', networkId, ')') AS label
    FROM `i19s-insights.insights.platform_network`   --> 3
  ),
  fact AS (
    SELECT * --> 4
    FROM fact_raw
  )
SELECT *
FROM fact
JOIN net ON fact.networkId = net.networkId;

(1) An alias is defined for the subquery, this alias can be reused. Here the name is "net".
(2) The fields that are required for later use are selected. You can also use the CONCAT function to display your own field names and mark them as labels. This label is available later in the Data studio.
(3) Specifies the origin of the dimension data. Replace the name of the dataset.
(4) Include only fields you need in report to save costs.

3. Join multiple dimension tables

Now we can join other dimension tables similar to previous section:

WITH
  fact_raw AS (
    SELECT conversionId, networkId, advertiserId , targetId, categoryId 
    FROM `i19s-insights.insights.finance_conversion`
    WHERE DATE(trackedAt) BETWEEN '2020-01-01' AND '2020-01-01'
    LIMIT 100
  ),
  net AS (
    SELECT networkId,
      name, logoUrl, CONCAT(name, ' (', networkId, ')') AS label
    FROM `i19s-insights.insights.platform_network`
  ),
  adv AS (
    SELECT advertiserId, status, name, nameEnglish, websiteUrl, CONCAT(name, ' (',  advertiserId, ')') as label
    FROM `i19s-insights.insights.partnerships_advertiser`
  ),
  ct AS (
    SELECT conversionTargetId , status, label as tech_label, name, CONCAT(name, ' (',  conversionTargetId, ')') as label
    FROM `i19s-insights.insights.trail_conversiontarget` 
  ),
  tcat AS (
    SELECT trackingCategoryId , categoryTitle, status, visibility, alias, CONCAT(categoryTitle, ' (',  trackingCategoryId, ')') as label
    FROM `i19s-insights.insights.trail_trackingcategory`  
  ),
  fact AS (
    SELECT *
    FROM fact_raw
  )
SELECT *
FROM fact
LEFT JOIN net  ON fact.networkId  = net.networkId
LEFT JOIN adv  ON fact.advertiserId = adv.advertiserId 
LEFT JOIN ct   ON fact.targetId   = ct.conversionTargetId 
LEFT JOIN tcat ON fact.categoryId = tcat.trackingCategoryId

4. Prevent naming conflicts and create transparency

As you can easily see, all data is merged on one level. When there are name conflicts, BigQuery appends "_1" to a table name. This does not look very nice. The following query wraps jointed tables in a STRUCT to create more transparency.

WITH
  fact_raw AS (
    SELECT conversionId, networkId, advertiserId , targetId, categoryId 
    FROM `i19s-insights.insights.finance_conversion`
    WHERE DATE(trackedAt) BETWEEN '2020-01-01' AND '2020-01-01'
    LIMIT 100
  ),
  net AS (
    SELECT networkId,
      name, logoUrl, CONCAT(name, ' (', networkId, ')') AS label
    FROM `i19s-insights.insights.platform_network`
  ),
  adv AS (
    SELECT advertiserId, status, name, nameEnglish, websiteUrl, CONCAT(name, ' (',  advertiserId, ')') as label
    FROM `i19s-insights.insights.partnerships_advertiser`
  ),
  ct AS (
    SELECT conversionTargetId , status, label as tech_label, name, CONCAT(name, ' (',  conversionTargetId, ')') as label
    FROM `i19s-insights.insights.trail_conversiontarget` 
  ),
  tcat AS (
    SELECT trackingCategoryId , categoryTitle, status, visibility, alias, CONCAT(categoryTitle, ' (',  trackingCategoryId, ')') as label
    FROM `i19s-insights.insights.trail_trackingcategory`  
  ),
  fact AS (
    SELECT *
    FROM fact_raw
  )
SELECT 
  fact.*,
  ( SELECT AS STRUCT net.* ) AS net,
  ( SELECT AS STRUCT adv.* ) AS adv,
  ( SELECT AS STRUCT ct.* ) AS ct,
  ( SELECT AS STRUCT tcat.* ) AS tcat,
FROM fact 
LEFT JOIN net  ON fact.networkId  = net.networkId
LEFT JOIN adv  ON fact.advertiserId = adv.advertiserId 
LEFT JOIN ct   ON fact.targetId   = ct.conversionTargetId 
LEFT JOIN tcat ON fact.categoryId = tcat.trackingCategoryId

5. Join dimension into REPEATED field of fact table

When we try to join the Ad Space dimension the way we did above, we will find that there is no adspaceId we can use for JOIN on the top level. The adspaceId is part of a nested REPEATED field fact.touchpoint, instead and can’t be addressed directly for the JOIN. In order to add dimension data there, we have to UNNEST the repeated field, enrich it with the master data via JOIN and combine the result into a new repeated ARRAY field. The following query will show how to do it. For simplicity, we omit joining the master data on the top level we did in the previous section.

WITH
  fact_raw AS (
    SELECT conversionId, touchpoint
    FROM `i19s-insights.insights.finance_conversion`
    WHERE DATE(trackedAt) BETWEEN '2020-01-01' AND '2020-01-01'
    LIMIT 100
  ),
  ads AS (
    SELECT adspaceId, name, status, websiteUrl, categoryId, partnerId, channelId, CONCAT(name, ' (',  adspaceId, ')') as label
    FROM `i19s-insights.insights.partnerships_adspace` 
  ),
fact AS ( --> 1
  SELECT * --> 2
    EXCEPT(touchpoint),
    ARRAY( --> 3
      SELECT AS STRUCT unnested_field.*, --> 4
      (SELECT AS STRUCT ads.*) AS ads --> 5
      FROM UNNEST(touchpoint) AS unnested_field --> 6
      JOIN ads on unnested_field.adspaceId = ads.adspaceId --> 7
    ) touchpoint --> 8
  FROM fact_raw
)
SELECT *
FROM fact

(1) An alias is defined for the subquery, this alias can be reused. Here the name is "fact".
(2) Select all data except the array data structure. In this case "touchpoint". It not only makes sense but is recommended to select certain fields.
(3) The new array is built, which contains the enriched touchpoint data.
(4) All fields (or only the required ones) from the original dimension table are selected as STUCT.
(5) Joined masterdata as STRUCT
(6) Unnest unpacks an array into a result set.
(7) The dimension is joined to the fact table using the ID.
(8) The array gets the same name as before. This replaces the original array with an array which is extended by the dimension data.

Left joins not allowed here, causes error: "Query error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN"

6. Extend dimension with additional master data

A deeper look into a part of the Query in 5. shows that the Ad Space dimension contains some id columns without additional data:

SELECT adspaceId, name, status, websiteUrl, categoryId, partnerId, channelId, CONCAT(name, ' (',  adspaceId, ')') as label
FROM `i19s-insights.insights.partnerships_adspace`

We can use the same approach to extend the dimension itself by joining additional dimension tables. The following query joins additional fields to the AdSpace dimension (fact table is omitted for clarity):

Network
Partner
Tracking Channel

This way our dimension is enriched by names, labels etc.

WITH
partner AS (
  SELECT
    partnerId,
    entityId,
    prename,
    name,
    countryCode,
    companyName,
    status
  FROM `i19s-insights.insights.partnerships_partner` 
),
cat AS (
  SELECT
   networkCategoryId,
   label, 
   status,
  FROM `i19s-insights.insights.platform_networkcategory`  
),
tc AS (
  SELECT
    channelId,
    alias,
    type,
    CONCAT( alias, ' (',  channelId, ')') as label
  FROM `i19s-insights.insights.trail_channel`   
)
SELECT
  _ads.adspaceId,
  _ads.name,
  _ads.status,
  _ads.websiteUrl,
  _ads.categoryId,
  _ads.partnerId,
  _ads.channelId,    
  CONCAT(_ads.name, ' (',  _ads.adspaceId, ')') as label,
  ( SELECT AS STRUCT _cat.* )     as cat, 
  ( SELECT AS STRUCT _partner.* ) as partner,
  ( SELECT AS STRUCT _tc.* )      as tc
FROM `i19s-insights.insights.partnerships_adspace`  _ads
JOIN partner _partner ON _ads.partnerId         = _partner.partnerId
JOIN cat     _cat     ON _ads.categoryId        = _cat.networkCategoryId
JOIN tc      _tc      ON _ads.channelId = _tc.channelId;

7. Merging extended dimensions into fact query (putting 5. and 6. together)

The next query selects the facts from the fact data table and adds AdSpace dimension to the REPEATED field, additionally the AdSpace dimension itself is extended by additional information related to the Partner, Category and Tracking channel IDs contained in the AdSpace data:

WITH fact_raw AS (
  SELECT * FROM `i19s-insights.insights.finance_conversion`
  WHERE DATE(trackedAt) between '2020-01-01' and '2020-01-01'
  LIMIT 100
),
partner AS (
  SELECT
    partnerId,
    entityId,
    prename,
    name,
    countryCode,
    companyName,
    status
  FROM `i19s-insights.insights.partnerships_partner` 
),
cat AS (
  SELECT
   networkCategoryId,
   label, 
   status
  FROM `i19s-insights.insights.platform_networkcategory`  
),
tc AS (
  SELECT
    channelId,
    alias,
    type
  FROM `i19s-insights.insights.trail_channel`  
),
ads AS (
  SELECT
    _ads.adspaceId,
    _ads.name,
    _ads.status,
    _ads.websiteUrl,
    _ads.categoryId,
    _ads.partnerId,
    _ads.channelId,    
    CONCAT(_ads.name, ' (',  _ads.adspaceId, ')') as label,
    ( SELECT AS STRUCT _cat.* )     as cat, 
    ( SELECT AS STRUCT _partner.* ) as partner,
    ( SELECT AS STRUCT _tc.* )      as tc
  FROM `i19s-insights.insights.partnerships_adspace`  _ads
  JOIN partner _partner ON _ads.partnerId         = _partner.partnerId
  JOIN cat     _cat     ON _ads.categoryId        = _cat.networkCategoryId
  JOIN tc      _tc      ON _ads.channelId = _tc.channelId
),
fact AS (
  SELECT 
    * EXCEPT(touchpoint),
    ARRAY(
      SELECT AS STRUCT tp.*,
      (SELECT AS STRUCT ads.*) AS ads
      FROM UNNEST(touchpoint) AS tp
      join ads on tp.adspaceId = ads.adspaceId
    ) touchpoint
  FROM fact_raw
)
SELECT * FROM fact;

Similarly, other REPEATED fields (or STRUCTS) in the fact table or dimension tables can be enriched by dimension data using the same approach.

8. Putting it all together (7 plus top level dimensions)

Now we add the dimensions related to the columns in the fact table as shown in “3. Join multiple dimension tables” and get our whole query

WITH fact_raw AS (
  SELECT * FROM `i19s-insights.insights.finance_conversion`
  WHERE DATE(trackedAt) between '2020-01-01' and '2020-01-01' -- YYYY-MM-DD
  LIMIT 100
),
net AS (
  SELECT
    networkId,
    name,
    logoUrl,
    CONCAT(name, ' (',  networkId, ')') as label
  FROM `i19s-insights.insights.platform_network`
),
adv AS (
  SELECT
    advertiserId,
    status,
    name,
    nameEnglish,
    websiteUrl,
    CONCAT(name, ' (',  advertiserId, ')') as label
  FROM `i19s-insights.insights.partnerships_advertiser`
),
ct AS (
  SELECT
    conversionTargetId,
    status,
    label as tech_label,
    name,
    CONCAT(name, ' (',  conversionTargetId, ')') as label
  FROM `i19s-insights.insights.trail_conversiontarget`
),
tcat AS (
  SELECT
    trackingCategoryId,
    categoryTitle,
    status,
    visibility,
    alias,
    CONCAT(categoryTitle, ' (',  trackingCategoryId, ')') as label
  FROM `i19s-insights.insights.trail_trackingcategory`
),
partner AS (
  SELECT
    partnerId,
    entityId,
    prename,
    name,
    countryCode,
    companyName,
    status
  FROM `i19s-insights.insights.partnerships_partner`
),
cat AS (
  SELECT
   networkCategoryId,
   label, 
   status
  FROM `i19s-insights.insights.platform_networkcategory`
),
tc AS (
  SELECT
    channelId,
    alias,
    type
  FROM `i19s-insights.insights.trail_channel`
),
ads AS (
  SELECT
    _ads.adspaceId,
    _ads.name,
    _ads.status,
    _ads.websiteUrl,
    _ads.categoryId,
    _ads.partnerId,
    _ads.channelId,    
    CONCAT(_ads.name, ' (',  _ads.adspaceId, ')') as label,
    ( SELECT AS STRUCT _cat.* )     as cat, 
    ( SELECT AS STRUCT _partner.* ) as partner,
    ( SELECT AS STRUCT _tc.* )      as tc
  FROM `i19s-insights.insights.partnerships_adspace` _ads
  JOIN partner _partner ON _ads.partnerId         = _partner.partnerId
  JOIN cat     _cat     ON _ads.categoryId        = _cat.networkCategoryId
  JOIN tc      _tc      ON _ads.channelId         = _tc.channelId
),
fact AS (
  SELECT 
    * EXCEPT(touchpoint),
    ARRAY(
      SELECT AS STRUCT tp.*,
      (SELECT AS STRUCT ads.*) AS ads
      FROM UNNEST(touchpoint) AS tp
      join ads on tp.adspaceId = ads.adspaceId
    ) touchpoint
  FROM fact_raw
)
SELECT 
  fact.*
  ,( SELECT AS STRUCT net.*  ) AS net
  ,( SELECT AS STRUCT adv.*  ) AS adv
  ,( SELECT AS STRUCT ct.*   ) AS ct
  ,( SELECT AS STRUCT tcat.* ) AS tcat
FROM fact
LEFT JOIN net  ON fact.networkId  = net.networkId
LEFT JOIN adv  ON fact.advertiserId    = adv.advertiserId
LEFT JOIN ct   ON fact.targetId   = ct.conversionTargetId
LEFT JOIN tcat ON fact.categoryId = tcat.trackingCategoryId;

i19s Partner Marketing Knowledge