I've been in the HubSpot ecosystem since 2022, helping businesses architect their tech stacks, data models, automations, establishing common standards for data exchange between different platforms businesses up and building up the confidence of sales, marketing and customer success teams in the tools that are available to them.
It's a bit niche, but, I love what I do.
Lately, I've taken an interest in niching down even more—I've been helping companies de-silo the platforms they use. Silo-ing is basically when a system/platform/tool functions on its own, without taking input from, or providing output to other systems in a business' tech stack. This is bad because without effective data exchange, the teams using these tools are at an information disadvantage and cannot make well-informed decisions. Basically, "Knowledge is Power, France is bacon".
Long story short: I need to set up a daily sync of sales intelligence data from Definitive Health Care (DHC) to HubSpot so that sales and marketing teams in HubSpot can make use of it. Currently, the data is refreshed once a quarter (!!!), which means a LOT of the intelligence data is stale or straight up wrong.
Technical Challenges
Every 24 hours, we need to sync 10,000 companies, and data on employees of each of the 10,000 companies. Assuming an average of 10 employees/company, that's 100k records that need to be synced every 24 hours.
Rate limits are a bitch.
HubSpot has a rate limit of 120 calls/min. DHC doesn't have a defined rate limit (but early testing showed a performance ceiling of 100 calls/min).
Let's say we have 120 calls/min rate limit. At 100k records, we'll need: 100k / 120 / 60 = 13.89 hours to sync 100k records. 14 hours is fine, it's within the 24 hour limit so it should be ok.
But think of what happens to the dev/debug cycle. It takes 14 hours to run a test, discover a bug or an edge case, then implement a fix, then test the implementation for another 14 hours.
Frustrating.
How Can We Make This Faster?
One word: batching.
Companies
DHC luckily has endpoints that let us fetch data in batches of 100 at a time. This makes getting the base Company objects from DHC fast—at about 100 seconds to get 10k companies.
Then for each Company, we need to get its financial data, and attach it to the Company. This takes an additional 100 seconds when batched.
On the HubSpot side, once we have all the intelligence data for 10k companies, we can batch upsert the records. The batching limit here is 100 items/call. So we'll need 100 calls for 10k companies. We can do this in under 60 seconds (as per 120 rpm limit).
Contacts
This one's a little bit tricky. We need to get contacts at each of the companies in DHC, but we can't batch this. So we'll have to make 10k raw API calls to DHC. At 100 rpm, this will take 100 minutes, or 1:40 hours.
The good part is, while we're getting these contacts, we can stream the responses to HubSpot in parallel (with batching), meaning we don't need to wait for 1:40 hours to start updating data in HubSpot! However, because HubSpot batch endpoints fail the entire batch even if one of the items in the batch fails to upsert, we will need to take extra precautions when it comes to error handling and capturing information about failed batch upserts so we can try them again later, or at least be aware of them.
To Summarize
We've brought the total expected run time from ~14 hours, down to:
100 sec + 100 sec + 60 sec + 1:40 hours = 4:20 minutes (nice) + 1:40 hours = 1:44:20 hours.
That's a 794% faster. I'll take that win.
What now?
Well, I need to implement this. I've tested the batch fetching of DHC data, but need to test how well batch upserts work in HubSpot's API. I'm primarily concerned about how to handle failed batches—where do I store this info and how can I retry each of the items in the batch so I can isolate individually failing objects.
I might post an update here in the coming days, stay tuned!