Our customers use Supaglue to sync up to tens of millions of records from their customers’ CRMs to their own databases. The CRM data is often used to power AI and other user-facing product experiences, so it’s critical that the syncs happen quickly and reliably.
In this post, we discuss the initial implementation of our Salesforce-to-Postgres pipeline and how we significantly improved its speed through various performance optimizations.
In the early days of Supaglue, we wanted to ship a working end-to-end product as quickly as possible. To move fast, the initial Salesforce → Postgres sync implementation was quite naive. It looked something like this:
1. Issue a request to Salesforce’s REST API (which uses SOQL, Salesforce’s SQL-like query language) for all contacts. For example:
2. The response returns one page of records along with a cursor for the next page of results. We map the records into a call to our Postgres ORM, Prisma, which generated a SQL statement to batch upsert those records.
3. Repeat steps 1-2 for all pages.
While this is a very common pattern and was easy to generalize to other CRM providers, it took too long to sync data for customers with millions of CRM records.
Salesforce’s Bulk API 2.0 is the recommended way to query for large amounts of data. Bulk API 2.0 enables asynchronous processing of SOQL queries, breaking it into several phases:
1. First, submit a query job, which returns a jobId:
2. Then, repeatedly poll for the status of the query job.
3. When it’s complete, you can issue requests to download the records, which have been divided into chunks, which you can think of as pages. Salesforce decides the sizing of these chunks. The response header contains the key Sforce-locator, representing the cursor for the next chunk, and the records will be in the response body.
While the chunks of data were often very large (we often saw 200k+ records per chunk), we would break the chunks into smaller batches to upsert into Postgres using Prisma.
While this was more efficient than the naive paging solution, it had some issues:
There’s no reason to wait until an entire chunk of data is read into memory from Bulk API 2.0 before starting the work of upserting them in batches to Postgres. So, we decided to stream the data from Salesforce into Postgres. The high level flow looked something like this:
Let’s explore these steps in detail.
In our Typescript/Node.js codebase, we had been using axios to issue requests to Salesforce. Unfortunately, axios does not support consuming the response body as a stream, so we had to move to using the native fetch module instead. Here’s some sample code for how to do that:
There’s plenty of nuance and complexity to discuss with streams in Node.js, but that should be saved for another post.
In order to map to our common schema, we just chain on another Transform to the pipeline, like this:
We need to map this back into CSV format to make use of Postgres’ COPY command (we’ll talk about it next). We found a nice CSV stringifier with streaming API, which would take the incoming Salesforce records and format them for Postgres’ COPY:
The recommended way to ingest large amounts of data into Postgres is by using the COPY command. Postgres can COPY data from stdin or file in a variety of formats, including csv. We cannot copy directly into the main table, however, since we only want the latest version of each record, and not one row per version of each record. Therefore, we stream our data into a temporary table with a schema identical to that of the main table using the pg and pg-copy-streams Node.js modules. Temporary tables are convenient to use here, because they are only visible to the current session.
Now that our records are streamed into the temporary table, we need to upsert them into the main table. We used a INSERT INTO ... ON CONFLICT ... DO UPDATE SET ... SQL query to upsert records:
While we’re happy with our performance for where we are today, we’d like to explore strategies to improve our syncs in the future. For example:
Sync performance is core to what we do and the product experience we deliver to our customers. If you’re thinking about syncing data from your customers’ CRMs, please reach out. We’d love to chat!