
Change Data Capture (CDC) is an important tool in real-time, event-driven architectures. Learn about CDC and its role in real-time data in this helpful overview.
Contents
The Usage of CDC Column
One of my colleagues asked me that whether, is it a good idea to use ‘MODIFIED_ON’ or ‘CHANGED_ON’ column(s) for incremental load purpose.
While using a 'MODIFIED_ON'
column can technically fulfill both Change Data Capture (CDC) and incremental load requirements in some scenarios, it is generally NOT considered a good practice in data warehousing and can lead to significant issues and complexities as per my past DWH experiences.
Why CDC Column Alone is Problematic for Incremental Loads and CDC
While a ‘MODIFIED_ON’ column might seem like a simple solution for incremental data loads, relying on it exclusively presents several critical drawbacks for data warehousing:
- A
'MODIFIED_ON'
column only tracks updates to existing records. It provides no information about records that have been deleted from the source system. Your incremental load logic based solely on ‘MODIFIED_ON’ will not be able to identify and propagate these deletions to your data warehouse. - While most updates will change the ‘MODIFIED_ON’ timestamp, there might be scenarios where a change occurs in the underlying data without the ‘MODIFIED_ON’ column being updated (e.g., system bugs, manual data manipulation without triggering the update mechanism). This can lead to data inconsistencies in your data warehouse.
- The granularity of the ‘MODIFIED_ON’ timestamp might not be sufficient to capture all changes, especially in high-volume transactional systems. Multiple changes within the same timestamp could be missed.
- Relying solely on scanning the entire source table based on a date range in the ‘MODIFIED_ON’ column for every incremental load can be inefficient, especially for large tables.
- It doesn’t explicitly identify what changed. This can be important for certain downstream processes or auditing requirements.
Good Practices for CDC and Incremental Loads in Data Warehousing
The preferred approaches for implementing CDC and incremental loads are more robust and explicitly designed for these purposes:
- Change Data Capture (CDC) Mechanisms
- Database transaction logs record all changes made to the database. CDC tools can parse these logs to identify inserts, updates, and deletes in near real-time. Examples include Debezium, Apache Kafka Connect with database connectors, and database-specific features like SQL Server Change Data Capture (CDC) and Oracle GoldenGate.
- Triggers can be implemented on source tables to record changes in a separate audit table. This approach can add overhead to the source system.
- Periodically taking snapshots of the data and comparing them, or maintaining version numbers or status flags (e.g., “active,” “deleted”) within the source table. This is less real-time but can be suitable for certain scenarios.
- Adding a specific column to the source table that explicitly indicates when a record was created or last changed for the purpose of CDC. This is different from just relying on a generic ‘MODIFIED_ON’.
- Incremental Load Strategies
- Using a timestamp column that reliably indicates when a record was created or last meaningfully changed. This is often used in conjunction with a proper CDC mechanism to ensure all changes are captured.
- If the source system provides sequential identifiers or version numbers for records, these can be used to identify new or updated records.
- Using flags to indicate records that need to be processed (e.g., a “to_be_processed” flag that is set upon creation or modification).
Why Industries Don’t Rely Solely on CDC Column?
The reason you don’t see industries heavily relying on just ‘MODIFIED_ON’ for both CDC and incremental loads is due to the inherent limitations mentioned above, particularly the inability to reliably track deletions and the potential for missed updates. Data warehouses are built on the principle of data integrity and accuracy and relying on a generic ‘MODIFIED_ON’ column introduces too much risk of data inconsistencies.
Relying solely on a ‘MODIFIED_ON’ column for both Change Data Capture (CDC) and incremental loads in a data warehouse is generally avoided due to significant limitations that compromise data integrity and accuracy. A fundamental issue is the inability of a ‘MODIFIED_ON’ column to track record deletions from the source system.
This column only registers updates to existing records, leaving the data warehouse unaware of any removals. Consequently, incremental load logic based solely on the ‘MODIFIED_ON’ timestamp will fail to propagate these deletions, leading to an incomplete and potentially misleading representation of the source data in the warehouse. Analytical reports might then include information about entities that no longer exist in the operational systems, undermining the reliability of business insights.
Furthermore, the accuracy of data in the warehouse is at risk when relying solely on ‘MODIFIED_ON’. While ideally, any data modification should trigger an update to this timestamp, real-world systems can exhibit inconsistencies. System bugs, manual data manipulations that bypass standard update mechanisms, or data migration processes might alter data without a corresponding change in the ‘MODIFIED_ON’ value.
This can result in missed updates during incremental loads, causing the data in the warehouse to diverge from the true state of the source systems. The cumulative effect of these missed changes over time can significantly erode the integrity of the data warehouse, making it an unreliable foundation for decision-making.
Finally, relying on scanning the entire source table based on a ‘MODIFIED_ON’ date range for every incremental load can introduce performance inefficiencies, especially as the volume of data grows. Additionally, the ‘MODIFIED_ON’ column merely indicates when a record was last changed, not what specifically changed. This lack of context can be crucial for certain downstream processes or auditing requirements. For these reasons, data warehousing best practices favor employing dedicated CDC mechanisms and incremental load strategies that are specifically designed to capture all types of data changes, including deletions, in a reliable and efficient manner, thereby ensuring the integrity and accuracy of the analytical data.
Summary
While MODIFIED_ON can seem like a convenient single solution, it’s a shortcut that often leads to problems in the long run. For robust and reliable CDC and incremental loads, it’s crucial to employ dedicated CDC mechanisms and well-defined incremental load strategies that explicitly address the challenges of tracking all types of data changes, including deletions. Using a dedicated CDC mechanism and then using a reliable timestamp (which might be CREATED_ON, MODIFIED_ON if its update is guaranteed for all meaningful changes, or a dedicated CDC timestamp) for incremental loading the captured changes into the data warehouse is a sounder approach.