In this tutorial you will learn how to write BigQuery SQL query and then use it as a custom query to create a Data Source in the Google Data Studio.
To understand the following steps it makes sense to gain an understanding of the structure of your data Ingenious provides in BigQuery.
It consists of two types of tables:
Fact tables (operational data)
Ad Impressions
Clicks
Untracked Requests
Conversions
Dimension tables (master data)
Advertisers
Partners
Entities
Ad Spaces
Conversion-Targets
Tracking-Categories
Creatives
Vouchers
More about facts and dimensions: https://www.geeksforgeeks.org/difference-between-fact-table-and-dimension-table/
The data in the fact tables are clearly structured and have a defined scheme. But it doesn’t contain much information about the master data like Advertiser’s name or URL. Instead it contains only IDs of master data. In order to present the Advertiser with its label instead of ID, it is necessary to to combine (JOIN) operational data with master data. During a JOIN, rows in two tables which have matching IDs are combined in a single row. Several reporting tools can do this immediately in the Report, but this feature is not available in most free tools. In example, Data Studio can combine (BLEND) only data sources with the same set of IDs and filters, which doesn’t allow to join at the same time multiple master data tables, i.E. Advertisers using advertiserId and Conversion Targets using targetId.
For better clarity we will build the Query bottom up using Common Table Expressions (CTE, see https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#with-clause) which looks like
WITH alias AS ( SELECT * FROM first_table ) SELECT * FROM alias; |
This allows us to threat the query like a pipeline and go step by step from simply to more complicated query.
With the first query you can generally access fact tables. In most cases, fact tables in BigQuery are partitioned to increase query efficiency and reduce costs. Therefore, it is always necessary to specify a restriction in the "WHERE" part of the SQL query. Your fact tables are partitioned by date, i.E. the table finance_conversion is partitioned by the date of tracked timestamp trackedAt.
This query selects all conversions tracked on 1st of January 2020 and displays a 100 rows
SELECT * FROM `ingenious-169318.insights_999.finance_conversion` WHERE DATE(trackedAt) between '2020-01-01' and '2020-01-01' -- YYYY-MM-DD LIMIT 100; |
In order to test this and following queries in a BigQuery console, replace insights_999 by the name of dataset containing your data
It makes no sense to look at fact tables only. It is necessary to add more dimensions to the fact tables.
WITH fact_raw AS ( SELECT * FROM `ingenious-169318.insights_999.finance_conversion` WHERE DATE(trackedAt) between '2020-01-01' and '2020-01-01' -- YYYY-MM-DD LIMIT 100 ), net AS ( --> 1 SELECT networkId, --> 2 name, logoUrl, CONCAT(name, ' (', networkId, ')') as label FROM `ingenious-169318.insights_999.network` --> 3 ), fact AS ( SELECT * --> 4 FROM fact_raw ) SELECT * FROM fact JOIN net ON fact.networkId = net.networkId; |
(1) An alias is defined for the subquery, this alias can be reused. Here the name is "net".
(2) The fields that are required for later use are selected. You can also use the CONCAT function to display your own field names and mark them as labels. This label is available later in the Data studio.
(3) Specifies the origin of the dimension data. Replace the name of the dataset.
(4) Include only fields you need in report to save costs.
The query above is not runnable in BigQuery due to column naming conflicts: both fact and dimension tables have a column networkId so that you will get the error in big query “Duplicate column names in the result are not supported. Found duplicate(s): networkId“
The following query fixes this problem by wrapping the columns of the dimension table net in a STRUCT:
WITH fact_raw AS ( SELECT * FROM `ingenious-169318.insights_999.finance_conversion` WHERE DATE(trackedAt) between '2020-01-01' and '2020-01-01' -- YYYY-MM-DD LIMIT 100 ), net AS ( SELECT networkId, name, logoUrl, CONCAT(name, ' (', networkId, ')') as label FROM `ingenious-169318.insights_999.network` ), fact AS ( SELECT * FROM fact_raw ) SELECT fact.* ,( SELECT AS STRUCT net.* ) AS net --> Pay attention here FROM fact LEFT JOIN net ON fact.networkId = net.networkId; |
This query has no conflicts since the networkId form the fact table can be addressed directly, and the networkId from the network dimension table is encapsulated and accessible as net.networkId in the query result.
Now we can join other dimension tables similar to previous section:
WITH fact_raw AS ( SELECT * FROM `ingenious-169318.insights_999.finance_conversion` WHERE DATE(trackedAt) between '2020-01-01' and '2020-01-01' -- YYYY-MM-DD LIMIT 100 ), net AS ( SELECT networkId, name, logoUrl, CONCAT(name, ' (', networkId, ')') as label FROM `ingenious-169318.insights_999.network` ), adv AS ( SELECT id, status, name, nameEnglish, websiteUrl, CONCAT(name, ' (', id, ')') as label FROM `ingenious-169318.insights_999.advertiser` ), ct AS ( SELECT id, status, label as tech_label, name, CONCAT(name, ' (', id, ')') as label FROM `ingenious-169318.insights_999.conversiontarget` ), tcat AS ( SELECT id, categoryTitle, status, visibility, alias, CONCAT(categoryTitle, ' (', id, ')') as label FROM `ingenious-169318.insights_999.trackingcategory` ), fact AS ( SELECT * --> 4 FROM fact_raw ) SELECT fact.* ,( SELECT AS STRUCT net.* ) AS net ,( SELECT AS STRUCT adv.* ) AS adv ,( SELECT AS STRUCT ct.* ) AS ct ,( SELECT AS STRUCT tcat.* ) AS tcat FROM fact LEFT JOIN net ON fact.networkId = net.networkId LEFT JOIN adv ON fact.ownerId = adv.id LEFT JOIN ct ON fact.targetId = ct.id LEFT JOIN tcat ON fact.categoryId = tcat.id; |
When we try to join the Ad Space dimension the way we did above, we will find that there is no adspaceId we can use for JOIN on the top level. The adspaceId is part of a nested REPEATED field fact.touchpoint, instead and can’t be addressed directly for the JOIN. In order to add dimension data there, we have to UNNEST the repeated field, enrich it with the master data via JOIN and combine the result into a new repeated ARRAY field. The following query will show how to do it. For simplicity, we omit joining the master data on the top level we did in the previous section.
ads AS ( SELECT id, name, status, websiteUrl, categoryId, partnerId, trackingChannelId, CONCAT(name, ' (', id, ')') as label FROM `ingenious-169318.insights_999.adspace` ), fact AS ( --> 1 SELECT * --> 2 EXCEPT(touchpoint), ARRAY( --> 3 SELECT AS STRUCT unnested_field.*, --> 4 (SELECT AS STRUCT ads.*) AS ads --> 5 FROM UNNEST(touchpoint) AS unnested_field --> 6 join ads on unnested_field.adspaceId = ads.id --> 7 ) touchpoint --> 8 FROM fact_raw ); |
(1) An alias is defined for the subquery, this alias can be reused. Here the name is "fact".
(2) Select all data except the array data structure. In this case "touchpoint". It not only makes sense but is recommended to select certain fields.
(3) The new array is built, which contains the enriched touchpoint data.
(4) All fields (or only the required ones) from the original dimension table are selected as STUCT.
(5) Joined masterdata as STRUCT
(6) Unnest unpacks an array into a result set.
(7) The dimension is joined to the fact table using the ID.
(8) The array gets the same name as before. This replaces the original array with an array which is extended by the dimension data.
Left joins not allowed here, causes error: "Query error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN"
A deeper look into a part of the Query in 4. shows that the Ad Space dimension contains some id columns without additional data:
... SELECT id, name, status, websiteUrl, categoryId, -- Id of network category partnerId, -- partner Id trackingChannelId, -- tracking channel id CONCAT(name, ' (', id, ')') as label FROM `ingenious-169318.insights_999.adspace` |
We can use the same approach to extend the dimension itself by joining additional dimension tables. The following query joins additional fields to the AdSpace dimension (fact table is omitted for clarity):
Network
Partner
Tracking Channel
This way our dimension is enriched by names, labels etc.
WITH partner AS ( SELECT id, entityId, prename, name, countryCode, companyName, status FROM `ingenious-169318.insights_999.partner` ), cat AS ( SELECT networkCategoryId, label, status, FROM `ingenious-169318.insights_999.networkcategory` ), tc AS ( SELECT trackingChannelId, name, type, CONCAT( name, ' (', trackingChannelId, ')') as label FROM `ingenious-169318.insights_999.trackingchannel` ) SELECT _ads.id, _ads.name, _ads.status, _ads.websiteUrl, _ads.categoryId, _ads.partnerId, _ads.trackingChannelId, CONCAT(_ads.name, ' (', _ads.id, ')') as label, ( SELECT AS STRUCT _cat.* ) as cat, ( SELECT AS STRUCT _partner.* ) as partner, ( SELECT AS STRUCT _tc.* ) as tc FROM `ingenious-169318.insights_999.adspace` _ads JOIN partner _partner ON _ads.partnerId = _partner.id JOIN cat _cat ON _ads.categoryId = _cat.networkCategoryId JOIN tc _tc ON _ads.trackingChannelId = _tc.trackingChannelId; |
The next query selects the facts from the fact data table and adds AdSpace dimension to the REPEATED field, additionally the AdSpace dimension itself is extended by additional information related to the Partner, Category and Tracking channel IDs contained in the AdSpace data:
WITH fact_raw AS ( SELECT * FROM `ingenious-169318.insights_999.finance_conversion` WHERE DATE(trackedAt) between '2020-01-01' and '2020-01-01' -- YYYY-MM-DD LIMIT 100 ), partner AS ( SELECT id, entityId, prename, name, countryCode, companyName, status FROM `ingenious-169318.insights_999.partner` ), cat AS ( SELECT networkCategoryId, label, status FROM `ingenious-169318.insights_999.networkcategory` ), tc AS ( SELECT trackingChannelId, name, type FROM `ingenious-169318.insights_999.trackingchannel` ), ads AS ( SELECT _ads.id, _ads.name, _ads.status, _ads.websiteUrl, _ads.categoryId, _ads.partnerId, _ads.trackingChannelId, CONCAT(_ads.name, ' (', _ads.id, ')') as label, ( SELECT AS STRUCT _cat.* ) as cat, ( SELECT AS STRUCT _partner.* ) as partner, ( SELECT AS STRUCT _tc.* ) as tc FROM `ingenious-169318.insights_999.adspace` _ads JOIN partner _partner ON _ads.partnerId = _partner.id JOIN cat _cat ON _ads.categoryId = _cat.networkCategoryId JOIN tc _tc ON _ads.trackingChannelId = _tc.trackingChannelId ), fact AS ( SELECT * EXCEPT(touchpoint), ARRAY( SELECT AS STRUCT tp.*, (SELECT AS STRUCT ads.*) AS ads FROM UNNEST(touchpoint) AS tp join ads on tp.adspaceId = ads.id ) touchpoint FROM fact_raw ) SELECT * FROM fact; |
Similarly, other REPEATED fields (or STRUCTS) in the fact table or dimension tables can be enriched by dimension data using the same approach.
Now we add the dimensions related to the columns in the fact table as shown in “3. Join multiple dimension tables” and get our whole query
WITH fact_raw AS ( SELECT * FROM `ingenious-169318.insights_999.finance_conversion` WHERE DATE(trackedAt) between '2020-01-01' and '2020-01-01' -- YYYY-MM-DD LIMIT 100 ), net AS ( SELECT networkId, name, logoUrl, CONCAT(name, ' (', networkId, ')') as label FROM `ingenious-169318.insights_999.network` ), adv AS ( SELECT id, status, name, nameEnglish, websiteUrl, CONCAT(name, ' (', id, ')') as label FROM `ingenious-169318.insights_999.advertiser` ), ct AS ( SELECT id, status, label as tech_label, name, CONCAT(name, ' (', id, ')') as label FROM `ingenious-169318.insights_999.conversiontarget` ), tcat AS ( SELECT id, categoryTitle, status, visibility, alias, CONCAT(categoryTitle, ' (', id, ')') as label FROM `ingenious-169318.insights_999.trackingcategory` ), partner AS ( SELECT id, entityId, prename, name, countryCode, companyName, status FROM `ingenious-169318.insights_999.partner` ), cat AS ( SELECT networkCategoryId, label, status FROM `ingenious-169318.insights_999.networkcategory` ), tc AS ( SELECT trackingChannelId, name, type FROM `ingenious-169318.insights_999.trackingchannel` ), ads AS ( SELECT _ads.id, _ads.name, _ads.status, _ads.websiteUrl, _ads.categoryId, _ads.partnerId, _ads.trackingChannelId, CONCAT(_ads.name, ' (', _ads.id, ')') as label, ( SELECT AS STRUCT _cat.* ) as cat, ( SELECT AS STRUCT _partner.* ) as partner, ( SELECT AS STRUCT _tc.* ) as tc FROM `ingenious-169318.insights_999.adspace` _ads JOIN partner _partner ON _ads.partnerId = _partner.id JOIN cat _cat ON _ads.categoryId = _cat.networkCategoryId JOIN tc _tc ON _ads.trackingChannelId = _tc.trackingChannelId ), fact AS ( SELECT * EXCEPT(touchpoint), ARRAY( SELECT AS STRUCT tp.*, (SELECT AS STRUCT ads.*) AS ads FROM UNNEST(touchpoint) AS tp join ads on tp.adspaceId = ads.id ) touchpoint FROM fact_raw ) SELECT fact.* ,( SELECT AS STRUCT net.* ) AS net ,( SELECT AS STRUCT adv.* ) AS adv ,( SELECT AS STRUCT ct.* ) AS ct ,( SELECT AS STRUCT tcat.* ) AS tcat FROM fact LEFT JOIN net ON fact.networkId = net.networkId LEFT JOIN adv ON fact.ownerId = adv.id LEFT JOIN ct ON fact.targetId = ct.id LEFT JOIN tcat ON fact.categoryId = tcat.id; |