15 September 2025

Streaming CDC With Debezium: Keeping Data in Sync Across Heterogeneous Stores

Understand how Debezium's log-based change data capture eliminates batch ETL latency by reading database transaction logs and publishing structured events to Apache Kafka in under 200 milliseconds. This post compares CDC approaches, walks through a production Kafka Connect setup for PostgreSQL and MySQL, and demonstrates how event-driven pipelines enable cache invalidation, search indexing, and real-time analytics without coupling consumers to the source database.

A

Adyantrix Team

Adyantrix Editorial Team

Streaming CDC With Debezium: Keeping Data in Sync Across Heterogeneous Stores

Introduction

In today's interconnected digital landscape, managing multiple data stores efficiently has become imperative for organisations aiming to remain competitive. Businesses often rely on a myriad of databases across different platforms — relational systems such as PostgreSQL and MySQL sitting alongside document stores like MongoDB, message queues, search indices, and analytical warehouses — and keeping each of them consistent is no trivial undertaking. Traditional approaches, including scheduled ETL batch jobs or application-level dual writes, introduce latency, fragility, and an ever-growing surface area for bugs.

Enter Debezium: an open-source change data capture (CDC) platform that monitors database transaction logs in real time and broadcasts every insert, update, and delete as a structured event stream. Rather than polling the database at intervals or burdening application code with synchronisation logic, Debezium taps directly into the native replication mechanisms already built into modern databases. The result is a non-intrusive, low-latency pipeline that keeps downstream consumers perpetually in step with the source of truth.

This post examines the mechanics behind Debezium, explains why streaming CDC has become the preferred pattern for heterogeneous data synchronisation, walks through a production-grade implementation, and explores the measurable business outcomes organisations have achieved after adoption.

What is Change Data Capture (CDC)?

Change Data Capture is a design pattern — not a single product — that identifies and records mutations as they occur within a persistent data store. Rather than comparing full snapshots of a table at two points in time (which is both expensive and imprecise), CDC reads the database's own write-ahead log (WAL) or binary log, extracting a chronologically ordered record of every committed transaction.

Three primary approaches exist in practice:

  • Log-based CDC: Reads the database replication log (PostgreSQL's WAL, MySQL's binlog, SQL Server's transaction log). This is the most reliable method because the log is already maintained by the database engine for its own recovery and replication needs. Debezium uses this approach exclusively.
  • Trigger-based CDC: Attaches database triggers to tables of interest that write change records to a shadow table on every DML operation. This is portable but adds write overhead and couples the change-capture logic to the database schema.
  • Query-based CDC: Periodically polls tables using a timestamp or sequence column (updated_at, version) to identify rows that changed since the last run. This is the simplest to implement but misses hard-deleted rows and introduces latency bounded by the polling interval.

Log-based CDC, as implemented by Debezium, eliminates the polling interval entirely and captures hard deletes — two weaknesses that disqualify the other approaches for many production use cases.

Debezium: An Overview

Debezium is an open-source project, originally sponsored by Red Hat and now governed by the wider Apache community, built on top of Apache Kafka Connect. It provides a library of source connectors, each tuned to the replication protocol of a specific database engine. Supported databases include MySQL, PostgreSQL, SQL Server, Oracle, MongoDB, Db2, Cassandra (via commit log), and Vitess, among others.

The architecture is straightforward. A Debezium connector is deployed as a Kafka Connect worker process. On first start, the connector performs an initial snapshot — reading the current contents of the targeted tables in a consistent, non-blocking fashion — and emits each existing row as a creation event. Once the snapshot is complete, the connector switches to streaming mode, following the database log in real time. Every committed change is converted into a JSON (or Avro) event that conforms to Debezium's envelope schema and is published to a dedicated Kafka topic, typically named <server-name>.<schema>.<table>.

Each event carries both the before and after images of the row, the operation type (c for create, u for update, d for delete, r for read/snapshot), the source database metadata, and a precise timestamp and log position. This richness of context makes the event stream equally suitable for real-time replication, audit logging, cache invalidation, search index updates, and feeding machine-learning feature pipelines.

Key Benefits of Using Debezium

1. Real-time Data Synchronisation

Debezium enables sub-second propagation of changes from a source database to any number of downstream systems. In practice, end-to-end latency — from a committed write on PostgreSQL to a message available on a Kafka topic — is typically in the range of 50–200 milliseconds under normal load. This is orders of magnitude faster than the five-to-fifteen-minute windows common in batch ETL, and it is sufficient for the vast majority of operational use cases, including live dashboards, fraud-detection pipelines, and personalisation engines that need current inventory or pricing signals.

2. Event-driven Architecture

Because Debezium publishes changes to Kafka topics, any consumer — a microservice, a stream processor, a data warehouse loader, a cache invalidation listener — can subscribe independently and process events at its own pace without coupling directly to the source database. This decoupling is a first-class architectural benefit. Adding a new consumer requires no changes to the source application or database schema. The Kafka topic becomes a durable, replayable record of all state transitions, meaning a newly deployed service can bootstrap its own read model by consuming from the beginning of the topic.

3. Flexibility and Scalability

Kafka's partitioned, distributed log allows Debezium-fed pipelines to scale horizontally. A single Kafka cluster can sustain millions of events per second, and multiple consumer groups can read the same topic concurrently without any coordination overhead. Debezium connectors themselves can be scaled across a Kafka Connect cluster, and offset management is handled automatically so that connector restarts are seamless and exactly-once delivery semantics are achievable when combined with Kafka's transactional producer API.

4. Open-source Support

Debezium is released under the Apache Licence 2.0 and has a mature ecosystem with active maintenance. The connector library covers all major relational and several NoSQL databases. The community publishes a Debezium Server distribution — a standalone Java process for teams that do not want to operate a full Kafka cluster — which can route events directly to Amazon Kinesis, Google Pub/Sub, Azure Event Hubs, or HTTP endpoints. This flexibility makes Debezium viable for both cloud-native deployments and organisations incrementally modernising on-premises infrastructure.

Production Implementation: Step-by-Step

The following walkthrough covers a realistic production setup: a PostgreSQL 15 source, Debezium 2.x running on Kafka Connect, and two downstream sinks — an Elasticsearch index for full-text search and a ClickHouse analytical store.

Step 1 — Configure PostgreSQL for Logical Replication

Debezium's PostgreSQL connector uses logical replication, which must be enabled at the server level. In postgresql.conf, set:

wal_level = logical
max_replication_slots = 4
max_wal_senders = 4

Create a dedicated replication user with the minimum required privileges:

CREATE ROLE debezium_user REPLICATION LOGIN PASSWORD 'strong_password';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO debezium_user;

Step 2 — Deploy Kafka and Kafka Connect

Use the Confluent Platform or the vanilla Apache Kafka distribution. For containerised environments, the official Debezium Docker images bundle Kafka Connect with the connector plugins pre-installed. A minimal docker-compose.yml would define services for Zookeeper (or KRaft in Kafka 3.x), Kafka brokers, and a Kafka Connect worker with the CONNECT_PLUGIN_PATH pointing to the downloaded Debezium connector archives.

Step 3 — Register the Debezium PostgreSQL Connector

POST the following JSON payload to the Kafka Connect REST API at http://connect-host:8083/connectors:

{
  "name": "pg-source-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "pg-primary.internal",
    "database.port": "5432",
    "database.user": "debezium_user",
    "database.password": "strong_password",
    "database.dbname": "production",
    "database.server.name": "prod_pg",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "debezium_publication",
    "table.include.list": "public.orders,public.products,public.customers",
    "topic.prefix": "prod_pg",
    "snapshot.mode": "initial",
    "decimal.handling.mode": "double",
    "heartbeat.interval.ms": "10000"
  }
}

Key configuration decisions here: plugin.name: pgoutput uses PostgreSQL's native output plugin rather than the deprecated decoderbufs, avoiding an additional C extension dependency. heartbeat.interval.ms ensures the connector advances the WAL LSN (log sequence number) position even during quiet periods, preventing unbounded WAL accumulation on the source.

Step 4 — Deploy Sink Connectors

For Elasticsearch, use the Confluent Elasticsearch Sink Connector and point it at the prod_pg.public.products topic. For ClickHouse, either use the official ClickHouse Kafka engine or the community clickhouse-kafka-connect sink connector. Both read the Debezium envelope schema and upsert rows using the primary key embedded in each event.

Step 5 — Implement Schema Evolution Handling

As the source schema evolves — new columns are added, data types change — Debezium captures the updated schema in each event envelope. Register a Schema Registry (Confluent Schema Registry or AWS Glue Schema Registry) to store Avro schemas and enforce compatibility rules. Set compatibility to FORWARD for most production workloads: new fields may be added but existing fields cannot be removed, ensuring older consumers do not break when schemas are updated.

Step 6 — Monitor and Alert

Debezium exposes JMX metrics covering lag (the number of unprocessed WAL bytes between the connector's current position and the database's latest position), event throughput, and snapshot progress. Scrape these with a Prometheus JMX exporter and build Grafana dashboards tracking:

  • Connector lag in seconds — alert if > 30 seconds
  • Events per second — baseline and anomaly detection
  • Replication slot retained WAL size — alert if > 1 GB to prevent disk exhaustion on the source

Real-world Case Studies

Fintech: Real-time Fraud Signal Propagation

A mid-sized European payments processor was running a nightly ETL job to synchronise its transactional PostgreSQL database with a Redis cache used by its fraud-scoring service. The 12-hour lag meant the fraud model was operating on stale merchant risk scores, leading to a measurable increase in false negatives during morning trading peaks.

After deploying Debezium against the transactions and merchant tables, the fraud team replaced their batch job with a Kafka Streams application that consumed the CDC event stream, recomputed risk features incrementally, and wrote updated scores back to Redis within 200 milliseconds of each transaction commit. False-negative fraud rates dropped by 18% in the first month, and the overnight maintenance window for the ETL job was eliminated entirely.

E-commerce: Keeping Search Indices Current

A European online retailer with a catalogue of over four million SKUs relied on a nightly Elasticsearch reindex to keep product search results accurate. Price changes and stock-level updates made during the day were invisible to the search experience until the following morning. During high-traffic events such as flash sales, this created a visible discrepancy between the browsable catalogue and the actual checkout state, generating a significant volume of abandoned carts and customer complaints.

By implementing Debezium on their MySQL product database and routing events through a lightweight Kafka Streams processor to Elasticsearch, the retailer achieved near-real-time index updates. The average latency from a price change in the admin panel to that change appearing in search results fell from approximately 14 hours to under three seconds. Customer support ticket volume related to price discrepancies dropped by 40% within two months of go-live.

Healthcare: Audit-grade Data Lineage

A healthcare technology company operating under HIPAA was required to maintain an immutable audit trail of all modifications to patient records stored across three separate database systems — a legacy SQL Server instance, a PostgreSQL EHR platform, and a MongoDB document store for unstructured clinical notes. Their existing audit approach relied on application-layer logging, which meant any direct database write bypassing the application was invisible to the audit trail.

Debezium connectors for all three databases were deployed and configured to emit events to a dedicated Kafka topic. A downstream consumer appended these events to an append-only audit log in Amazon S3 in Parquet format, partitioned by date and table. Because Debezium captures changes at the database log level rather than the application layer, the audit trail was now complete and tamper-evident regardless of how changes reached the database. Compliance auditors confirmed the approach satisfied their HIPAA audit-control requirements at the next annual assessment.

Best Practices and Common Pitfalls

Manage Replication Slot Lag Carefully

PostgreSQL's logical replication slots prevent WAL segments from being recycled until the slot consumer has acknowledged them. If the Debezium connector is paused, falls behind, or is deleted without first dropping the slot, WAL files accumulate on the source server and can exhaust disk space, causing a production outage. Always set an alert on retained WAL size and configure max_slot_wal_keep_size (available from PostgreSQL 13) to enforce an upper bound, accepting that the slot will be invalidated rather than filling the disk.

Plan for Initial Snapshot Duration

On large tables, the initial snapshot can run for hours or days. During this period the connector holds a transaction open on the source database, which can interfere with autovacuum and lock operations. Use snapshot.mode: schema_only to skip the data snapshot and rely on downstream systems being pre-populated by other means (such as a bulk load), then switch to streaming-only mode. Alternatively, use Debezium's incremental snapshot feature (available from 1.6+), which chunks the snapshot into smaller transactions and interleaves it with streaming changes.

Design Idempotent Consumers

Kafka provides at-least-once delivery by default. Network retries, connector restarts, and rebalances can result in the same event being delivered more than once. All downstream consumers must be idempotent: upsert on primary key rather than insert, use conditional updates, or deduplicate using the ts_ms and lsn fields included in each Debezium envelope. For exactly-once semantics end to end, use Kafka's transactional producer API in conjunction with an idempotent sink.

Use Tombstone Events for Deletes

When Debezium emits a delete event, it publishes two messages: the delete event itself (with the before image of the deleted row) followed by a tombstone — a message with a null value and the same key. The tombstone signals to Kafka's log compaction that this key is eligible for removal from the compacted topic. Ensure your consumers handle null-value messages gracefully; many sink connectors require explicit configuration to process tombstones as delete operations rather than treating them as errors.

Isolate Connectors by Criticality

Do not run connectors for high-throughput, low-latency pipelines in the same Kafka Connect worker cluster as connectors for lower-priority batch synchronisation jobs. Worker rebalances triggered by slow connectors can pause all connectors on the worker, introducing unexpected lag spikes in production-critical pipelines. Separate worker clusters by SLA tier.

Measuring the Business Impact

Organisations that move from batch ETL to streaming CDC with Debezium consistently report improvements across four dimensions:

Data freshness: The headline metric. Latency moves from minutes or hours to seconds or sub-seconds, unlocking use cases — real-time personalisation, live fraud scoring, instant inventory visibility — that are simply not viable with batch pipelines.

Operational complexity: Removing scheduled batch jobs eliminates a class of failure modes (failed job runs, partial loads, dependency chains) and reduces the on-call burden on data engineering teams. Debezium's offset management and automatic restart logic handle transient failures without manual intervention.

Source database load: Counter-intuitively, CDC often reduces load on the source database compared to polling-based approaches. Log reading is a sequential I/O operation and does not require table scans. Eliminating heavy batch queries that run against the live database during business hours frees up capacity for application traffic.

Compliance posture: Log-based CDC provides a complete, ordered record of all state transitions at the database level, which is a stronger basis for audit trails than application-layer logging. For regulated industries this can be the deciding factor in architecture selection.

A reasonable benchmark for a greenfield Debezium deployment on a mid-sized relational database (50–200 tables, 5,000–50,000 writes per second) is a steady-state connector lag of under five seconds, with end-to-end pipeline latency (source commit to sink acknowledged) of under one second under typical load.

Conclusion

Debezium has matured into the de facto standard for log-based change data capture in the open-source ecosystem. Its ability to bridge heterogeneous data stores — relational and document databases, on-premises and cloud, OLTP and analytical — without requiring changes to application code or source schemas makes it uniquely well-suited to the polyglot persistence architectures that characterise modern enterprises. When combined with Apache Kafka's durable, scalable event log, it provides a foundation for data pipelines that are simultaneously real-time, resilient, and operationally tractable.

The path from proof-of-concept to production requires careful attention to operational concerns — replication slot management, schema evolution strategy, consumer idempotency, and monitoring — but each of these challenges has well-understood solutions. Organisations that invest in getting these foundations right unlock a genuinely different category of data capability: one where every part of the business reacts to the world as it is, not as it was several hours ago.

At Adyantrix, we design and implement production-grade data engineering pipelines for organisations that need more than generic consulting advice. Whether you are evaluating streaming CDC for the first time, migrating an existing batch ETL estate to Debezium, or troubleshooting replication lag in a pipeline already in production, our data engineering team brings hands-on experience across the full Debezium ecosystem — connector configuration, Kafka topology design, schema registry governance, and observability setup. Get in touch to discuss how we can help your organisation turn its databases into a reliable, real-time data backbone.

Speak with our Data Engineering team at Adyantrix to find out how we can support your next project.


← Back to Blog

Related Articles

You Might Also Like

Understanding Data Contracts: Formalising Agreements Between Producers and Consumers

8 September 2025

Understanding Data Contracts: Formalising Agreements Between Producers and Consumers

This post explains how data contracts formalise schema definitions, SLAs, ownership, and compliance requirements between data producers and consumers. It covers implementation using tools such as Great Expectations, Soda, and dbt, as well as semantic versioning strategies for managing breaking changes. Readers will learn why machine-readable, version-controlled contracts are essential to reliable data pipelines at scale.

Read More
Master Data Management: Creating a Single Source of Truth Across Business Units

1 September 2025

Master Data Management: Creating a Single Source of Truth Across Business Units

Understand how Master Data Management creates a single source of truth by centralising customer, product, and vendor records across a fragmented enterprise. The guide examines hub versus federated architectures, data quality management, governance councils, and platforms including Informatica MDM, SAP Master Data Governance, and Microsoft Azure Purview. A detailed healthcare case study demonstrates how MDM reduces errors, accelerates audits, and underpins digital transformation.

Read More
Schema Evolution Strategies That Keep Upstream and Downstream Teams Happy

25 August 2025

Schema Evolution Strategies That Keep Upstream and Downstream Teams Happy

Learn proven strategies for managing schema evolution without breaking upstream producers or downstream consumers in data engineering pipelines. This article covers backward and forward compatibility, schema versioning with Apache Avro, Protocol Buffers, Flyway, and Liquibase, plus communication practices including schema contracts and deprecation policies. You will gain a practical framework for keeping distributed data teams aligned through every structural change.

Read More
0%