GCP — SFMC Data Pipeline pt 1
In this article I will explain a simple architecture pattern I use in GCP to leverage SFMC APIs from GCP and how to reuse this pattern for various use cases across SFMC and Interaction Studio.
In addition to this, I’ll show you how you can use this pattern to create a light weight reusable solution to push large amounts of data into any SFMC BU without having to rebuild your pipeline or automations within each SFMC BU.
No more 20MB browser limits, Import Activities and SFTP file uploads.
This is perfect for those ad-hoc requests where you need to get file based data into SFMC and you need it to happen yesterday. This will work if you need to move data from Big Query or Cloud Storage into SFMC and can also become a building block of bigger solutions.
Technically speaking this is a simple data pipeline, but really, it’s more of a data sprinkler system.
Architecture
Before I get into the code (my samples are in Python) and platform, let’s talk through the architecture pattern that I tend to reuse for a lot of things. It’s fairly simple in its’ nature but ultimately very powerful, once created it can render manual UI activities almost obsolete . **see considerations
This is a really basic visualisation of the solution:
Solution Ingredients
2x GCP Cloud Storage Buckets
1x Cloud Function (triggered via Cloud Storage)
1x Target DE in SFMC
1x SFMC Connected App with relevant scope
Process
- Load a JSON file containing BU specific client secret, client id and the unique string section of the SFMC tenant specific URL into your config bucket (you can control the access to this bucket via IAM policies).
The file contents should look like this:
{“client_id”:”XXXXX”,
“client_secret”:”xxxx”,
“base_url”:”XXXXXX — XXXXX"
}
2. Load your csv (my examples uses a csv) or json file with the data that you want in your Data Extension into the files bucket.
3. This will invoke the cloud function that will do the following things:
- Read the triggering event from the files bucket
- Retrieve the file that triggered the event (GCP comparison of %%FileFromTrigger%% in Automation Studio)
- Convert the file into data in CF memory
- Delete the file from the bucket (we don’t want to store sensitive data here)
- Post to SFMC Auth endpoint and retrieve a token
- Compile batched arrays of data within JSON objects with x number of rows
- Post each batch of data to your SFMC DE of Choice via one of a few SFMC DE endpoints
Fairly simple right?
Essentially we have just eliminated the need to create an Import Activity, load the file onto the SFMC SFTP site and invoke the activity for each Data Extension that we want to import data into.
Scaling across multiple SFMC accounts
OK, that sounds really cool but….
Q: What If you work across multiple BUs/ many different SFMC MIDs, do I have to duplicate this solution every time!??
A: No, you don’t. We are programmatically determining which MID credentials use and subsequently send each request to rather than hard coding in the values, we also don’t need to know the fields names of the DE as the script will handle that for us.
Visually it would look something like this:
Ingredients
2 x GCP Cloud Storage Buckets
1 x Cloud Function
Many SFMC Connected Apps (and config json files)
Many SFMC Data Extensions within their respective MIDs
Considerations:
1. SFMC has no “hard” API limits but you may get a shut off or a call from your AE if you abuse the resources. Therefore be smart with your requests and break your data into chucks.
2. The REST API I’ll use in this solution in Async and therefore the data in the file should map directly to your target DE, you can build in error handling into your cloud function but it will require additional requests.
Alternatively you can use the Sync API but your target DE will need a primary key and your payload structure will be slightly different.
3. CF’s max run time is 9mins so you will want to manage your array size to suit, this is a bit of a balancing act to keep your payload under the 2MB HTTPS limit but also restrict the number of calls you are making which is the slowest part of function in terms of processing time. The CF has no problems managing large amounts of data in memory.
I’ve run this with 500–1000 rows per request with a 100 field DE using fields of 100 char length each (test scenario of course).
If you have large data files( a few hundred MB -> GBs) then consider the typical FTP approach (you can configure and trigger a request via API too if you need and I’ll share how you could do that in a later post) OR use GCP App Engine rather than a CF.
Next time: GCP — SFMC Data Pipeline pt 2
In the next post I’ll share the GCP config and Python code snippets I used to build this solution and some other optional extras.