Simplifying data warehousing for marketing analytics using GCP Big Query
For many SFMC users like myself, dealing with reporting and data analysis is painful at the best of times. We’re typically constrained to the platforms at hand and their capabilities, which a lot of the time don’t actually allow us to solve our immediate problems with ease.
In this article I’ll share how you can build a simple data warehouse on GCP Big Query that will allow you to write complex, fast processing SQL queries and quick data visualisations in a fraction of the time using a simple ELT process to transfer and transform your data from SFMC.
If you have GA360 your web data will be available in BQ via the GA360-BQ integration. There will likely be a custom dimension in your GA dataset linking SFMC Subscriber Id to your GA client Ids enabling you to link online behaviour to your SFMC data sets.
Is it ELT or ETL?
In the Martech industry we tend to get thrown a lot of acronyms and in some cases they don’t actually reflect the task or piece of technology correctly.
ETL = Extract, Transform, Load
Eg. Extract data from a source system, Transform it in a staging location or in transit and Load it into the target system.
ELT = Extract, Load, Transform
Eg. Extract data from a source system, Load the data into your platform and Transform it there.
There is no better or worse option for data processing and it will typically come down to other factors to determine your approach and architecture. For this example ELT is perfect as we can use SQL to transform the data rather than having to build, deploy, manage services to transform our data in transit (Dataflow, Apache beam, Dataproc etc).
There is also awesome vendors out there that have productised ETL solutions rather than building something from scratch. These solutions usually have predefined templates and scheduling for easy use.
This may seem fairly complex but it’s actually fairly UI driven only requires 1 Cloud Function (Python script below) and your existing SQL knowledge.
I have broken the GCP component into 3 projects which will allow you to control user level access to the data in its varying states, this may not be necessary for 100% of users.However, this is the typical and best practice approach.
- SFMC Automation runs to extract relevant data from DEs, convert into csv files (Data Extracts), transfer the files to SFMC’s SFTP directory (File Transfers).
- Cloud Function runs to copy SFMC SFTP folder content to GCP Cloud Storage
- BigQuery Transfer service imports data into BQ staging tables (Bronze Project)
** — these fields would all be string format, you will need to read from these to reformat to the desired field types later on
— 1 transfer per table.
- Scheduled Big Query SQL Queries read from the Bronze Project and write clean, reformatted data into the Silver project tables
- Scheduled Big Query SQL Queries read from the Silver Project and write de-normalised, aggregated data into the Gold project (may not be necessary depending on who’s using the solution).
- You can trigger cloud functions on schedule(Cloud scheduler + Pub/Sub) or via HTTPS so you may like to create a SSJS Activity to call the cloud function at the end of your automation.
- You will need to create a dedicated folder in your SFMC SFTP for this and move files in and out once processed (or overwrite the files)
Cloud Function to copy files from SFMC SFTP
This CF will copy the contents of the SFMC FTP Folder into your Cloud Storage Bucket of choice. It can be enhanced to unzip files, manage SSH Keys and leverage GCP Keys rather than hard coded values.
**CF’s can run for up to 9mins and have a max memory of 8GB
Official CF documentation can be found here.
Big Query Transfers
Big Query Transfers are a great UI driven way to schedule file imports into BQ from various sources including Cloud Storage, AWS S3,AWS Redshift, Teradata , Google Ad platforms etc.
This service will move dedicated files from your GCS Bucket into a dedicated BQ Table (Bronze Project). This would be your raw data in string form.
Offical documentation can be found here.
Scheduled queries in BQ will allow us to schedule data cleansing SQL queries that may overwrite or append to your table of choice.
**Note — BQ Supports table partitioning & Clustering (this will help speed up your queries and limit the amount of data scanned).
Official documentation can be found here.
If you need to create a dashboard or a quick chart based on a table you have just written to via SQL you can easily link a data studio or just click “Explore Data” which will export to a quick view of data studio.
I hope this shed some light and de-mystified how you might build a marketer friendly data warehousing solution.
There are various methods to getting data into BQ, real time streaming using Dataflow, BQ APIs, batch files, ETL products etc.
Ultimately your solution will come down to your data requirements, effort and investment.
GCP is a PAYG set of products so be mindful of how you use and design your solutions. There is a handy calculator and billing console that will help to estimate your monthly costs in addition to providing threshold notifications.
See GCP Pricing calculator here.