MERGE vs Insert/Delete: Database Constraints in CDC ETL Systems

While working on an ETL pipeline project recently, a business requirement completely changed the way I think about incremental data loading and Change Data Capture (CDC) patterns.

Let me tell you what happened and why it’s important for anyone building modern data pipelines. So, before diving into technical complexities, let’s create an actual scenario. Imagine a company approaches you with a requirement: they want to ensure the employee data stored in their HR system is regularly and accurately synced to its central data warehouse for accurate reporting and analytics. On the initial run, all employee records were loaded from the HR system to the data warehouse. Now, for every incremental update, we were supposed to sync only new employees and update any existing records if they’ve been modified. So, this is what the destination table looked like after the first few runs:

Contents

1 When Everything Was Perfect
2 Why Insert/Delete Logic Is Problematic
3 The Hero Arc: Enter MERGE/UPSERT
4 Real Talk: Industry Best Practices
5 Bottom Line

When Everything Was Perfect

The original flow of my ETL pipeline was:

Find the maximum created_at from destination: MAX(created_at) = "2023-09-18 08:40:00"
Find the maximum updated_at from destination: MAX(updated_at) = "2023-09-18 08:40:00"
Fetch incremental data from the source based on the 'created_at' column, where created_at > "2023-09-18 08:40:00".
Load it in destination table under staging schema.
Fetch modified records based on 'updated_at' column, where updated_at > "2023-09-18 08:40:00"
Load it in destination table, creating duplicates.
Clean up duplicates by deleting old versions of a record and only keeping the latest version.

So, on the following run, two new employees were added to our database. Also, there was a salary change for one of the existing employees.

After inserting the newly created records (employees 10 and 11) and updating an existing record (employee_id = 3), the database now contained duplicates. Here is my implementation of the deduplication logic:

DELETE FROM staging.employees a
WHERE EXISTS (
    SELECT 1 FROM staging.employees b
    WHERE a.employee_id = b.employee_id
    AND a.created_at = b.created_at
    AND a.updated_at < b.updated_at
);

DELETE FROM staging.employees a
WHERE EXISTS (
    SELECT 1 FROM staging.employees b
    WHERE a.employee_id = b.employee_id
    AND a.created_at = b.created_at
    AND a.updated_at < b.updated_at
);

This query removes outdated duplicates by keeping only the latest version of each record based on a few rules:

Compare each row in the employees table (a) with every other row (b) in the same table.
Check if both rows have same values in 'employee_id' and 'created_at' columns, this means they represent the same record.
The 'updated_at' (cdc_column) in row a is older than the one in row b.
If such a newer version (b) exists, then row a (the older one) gets deleted.

Everything was working perfectly.. well, until it wasn’t. Just on a random Friday a business requirement dropped like a plot twist:

“We need indexes on ALL tables for faster query execution.”

Now normally indexes are your friends, they make your query execution faster by acting as a quick reference guide for a database — but some of them come with uniqueness constraint. This is the point where my not-so-elegant logic of insert/delete completely crashed.

Here’s What Happened:

Fetch incremental data from source, insert it in destination table.
Fetch modified records from source.
Insert in destination table – Uniqueness constraint violated
Delete logic never gets to run – Pipeline failure

A uniqueness constraint basically says that you cannot have duplicate keys, while my logic was violating it by first creating duplicates and then cleaning them.

Why Insert/Delete Logic Is Problematic

Let’s break down the core issues with this approach, especially in modern data architectures:

1. Data Inconsistencies

CDC captures changes in your source by keeping track of newly inserted, updated, and deleted records. In real-time scenarios, to maintain data integrity, applying uniqueness constraint is essential. By trying to create duplicates first and later cleaning them, you are creating a race condition where you’re violating constraints before cleaning them up.

2. Performance Overhead

By inserting everything first and then doing a full table scan to delete duplicates, you are essentially doing double the work. Industry analysis shows that incremental loading strategies should minimize system resource utilization, but this approach does the opposite.

3. Atomicity Issues

Atomicity ensures all database operations in a transaction either fully complete or don’t happen at all. In-between failures will leave data corrupted. If insert happens and delete fails, data becomes inconsistent by producing duplicates. So your pipeline isn’t atomic.

4. Concurrency Problems

In a multi-threaded environment, this approach can create data corruption. While one process is inserting, another might be deleting, leading to inconsistent states.

The Hero Arc: Enter MERGE/UPSERT

After my pipeline threw its tantrum, I had to level up and implement a MERGE operation. Merge statement takes modified records from source data and merges it with target table. It checks if a record does not exist in target table then insert it; if it already exists then update it. So merge statement essentially handles the “insert if new, update if exists” logic in a single atomic operation. Here’s how I replaced the problematic insert/delete logic with a robust MERGE operation:

MERGE INTO staging.employees AS t
USING temp_modified_records AS s
ON t.employee_id = s.employee_id AND t.created_at = s.created_at
WHEN MATCHED AND s.updated_at > t.updated_at THEN
    UPDATE SET department_id = s.department_id, full_name = s.full_name, email = s.email, salary = s.salary, hire_date = s.hire_date, updated_at = s.updated_at
WHEN NOT MATCHED THEN
    INSERT (employee_id, department_id, full_name, email, salary, hire_date, created_at, updated_at)
    VALUES (s.employee_id, s.department_id, s.full_name, s.email, s.salary, s.hire_date, s.created_at, s.updated_at)

MERGE INTO staging.employees AS t
USING temp_modified_records AS s
ON t.employee_id = s.employee_id AND t.created_at = s.created_at
WHEN MATCHED AND s.updated_at > t.updated_at THEN
    UPDATE SET department_id = s.department_id, full_name = s.full_name, email = s.email, salary = s.salary, hire_date = s.hire_date, updated_at = s.updated_at
WHEN NOT MATCHED THEN
    INSERT (employee_id, department_id, full_name, email, salary, hire_date, created_at, updated_at)
    VALUES (s.employee_id, s.department_id, s.full_name, s.email, s.salary, s.hire_date, s.created_at, s.updated_at)

This single statement completely replaces the entire insert/delete workflow, handling both new records and updates atomically while respecting all database constraints. If Aleena Ali exists, update her salary to 80,000 PKR; if she does not exist, insert her as a new record. No constraint violations, no duplicates, no problems.

Here is why MERGE statements are superior:

MERGE handles everything in one transaction. No more race conditions, no more constraint violations, no more 3 AM debugging sessions.
“UPSERT” is the operation to merge new records with existing records using primary keys on a table. One operation instead of multiple separate insert/select/delete operations means better performance and fewer locks.

Real Talk: Industry Best Practices

Based on my experience and industry best practices, here are the key considerations:

No matter the number of times you run your pipeline, it should always produce the same results. This is essential for reliability in a production-level environment.
Always work with your database constraints, not against them. Modern data platforms like Snowflake, BigQuery, and Azure Synapse have optimized MERGE implementations specifically for this reason.
Keep an eye on MERGE operation performance as your data grows. The good news is that it scales much better than insert/delete patterns.

The data engineering community has been moving away from insert/delete patterns for good reasons. Incremental Load Testing in ETL involves verifying that the ETL process correctly identifies and processes only the new or modified data since the last load. Modern ETL frameworks emphasize that your transaction should be atomic, scalable to work in a distributed environment, and constraint compatibility should be considered as a first-class requirement.

Bottom Line

Look I get it — the insert/delete approach seems intuitive at first, I was there too. But in production environments with real business constraints, this approach is like building a house on quicksand.

MERGE/UPSERT operations are the industry standard for handling incremental data and CDC for a reason. They’re atomic, constraint-friendly, performant, and designed for exactly this use case.

My advice? Start with MERGE from day one. Don’t wait for your pipeline to break in production like mine did. Your future self (and your on-call schedule) will thank you.

If you liked this article, follow me on LinkedIn or other social media to stay up-to-date with my latest posts. You might also like the following posts about related topics: