How Walmart uses Apache Kafka in production

Walmart has been running Apache Kafka in production since at least 2016, and the scale of the deployment today reflects a decade of architectural iteration. As of June 2024, Walmart’s Kafka infrastructure processes trillions of messages per day, serves 25,000+ consumers across public and private clouds, and maintains 99.99% availability. The engineering problems Kafka solves at Walmart range from near-real-time search indexing and replenishment planning across 5,000+ stores to fraud detection on every online transaction and a Customer Data Platform ingesting around 40 billion events per day.

Company overview

Walmart is the world’s largest retailer by revenue, operating roughly 10,500 stores across 19 countries and a significant e-commerce business. Its technology organisation, Walmart Global Tech, manages the platform infrastructure that supports in-store systems, e-commerce, supply chain, and customer-facing applications simultaneously.

Kafka arrived at Walmart before 2016, initially on shared bare-metal clusters that replaced Apache Flume for log collection and tracking-pixel feeds. By late 2016, engineers Ning Zhang and Anil Kumar documented the first architectural shift: moving from shared clusters to a self-serving model where teams deployed dedicated clusters via OneOps, Walmart’s internal OpenStack-based platform. That shift established the architectural pattern Walmart has built on ever since: many purpose-built Kafka clusters rather than a single shared monolith.

Key milestones

Walmart’s Kafka use cases

Walmart uses Kafka across multiple systems, each solving a distinct operational problem. The use cases described here are drawn from published engineering posts by named Walmart engineers between 2016 and 2024.

Near-real-time search index

Anil Kumar, a Global eCommerce Engineer at Walmart Labs, described Kafka as “the backbone for our New Near Real Time (NRT) Search Index, where changes are reflected on the site in seconds” in a 2016 Confluent post. The system processes billions of pricing and inventory updates per day. Before Kafka, search index latency was measured in hours; the Kafka-backed pipeline reduced that to seconds.

Item-setup pipeline and microservices backbone

Kafka connects hundreds of microservices in Walmart’s item-setup pipeline, covering product normalisation, offers validation, pricing, inventory, and logistics data. Teams in different geographies operate autonomously because Kafka decouples producers from consumers: each team publishes to its own topics without coordinating with downstream consumers. The architecture uses many small clusters with hundreds of topics rather than one shared cluster, limiting the blast radius of any individual failure.

Real-time inventory management

Suman Pattnaik, Director of Engineering at Walmart, described a real-time inventory system in a 2020 Confluent post that replaced batch-based inventory processes. The system ingests from more than 10 sources of event streaming data and maintains a denormalised canonical view of inventory positions across stores, e-commerce, and distribution centres.

Real-time replenishment

The replenishment platform, described by Pattnaik in a 2022 Confluent post, spans 5,000+ stores, 150+ distribution centres, and 1,000+ vendors across 24+ countries. Change data capture feeds a denormalised view that passes through a planning engine holding inventory positions, forecasts, and safety stocks. The system processed close to 100 million SKUs in under three hours at 85 GB/min throughput as of 2022.

Customer Data Platform (CDP)

The Customer Backbone team, led by Navinder Pal Singh Brar (Staff Engineer) and Deepak Goyal, built a multi-tenant platform on Kafka Streams that ingests customer activity events (search, add-to-cart, transactions), builds customer identity and state using RocksDB, and triggers ML models (bid models, fraud detection, omnichannel reorder) per-event or in batch. The CDP processes around 40 billion events per day and serves processed knowledge at under 10 ms (95th-percentile) latency.

Fraud detection

Walmart runs fraud detection on every online transaction. The fraud detection application uses Kafka Streams and was the primary driver behind Walmart’s open-source contributions KIP-535 and KIP-562 — described in detail in the challenges section below.

Event stream analytics

A Druid-based analytics cluster fed from Kafka via Apache Storm ingests nearly 1 billion+ events per day (2 TB of raw data) and serves sub-second OLAP query latencies, replacing Hadoop/Hive/Presto workflows that previously took hours. Amaresh Nayak documented this pipeline on the Walmart Global Tech Blog in November 2017.

Cassandra CDC

Scott Harvester and Nitin Chhabra documented a pipeline in July 2022 that uses Debezium to capture change data from Cassandra 4.x and publish it to Kafka, supporting high-volume e-commerce databases that generate 60,000 data changes per second.

Scale and throughput

Walmart’s Kafka architecture

Multi-cluster topology

Walmart has consistently favoured many purpose-built clusters over a shared monolith. The early 2016 architecture ran 5 local clusters and 2 aggregation clusters across 7 geographic datacentres. Producers wrote to local clusters only; consumers listened to both local and mirrored topics simultaneously. Patched MirrorMaker with complete topic renaming replicated data between sites, designed to degrade gracefully on datacentre failure and resume from the point of failure on recovery.

The item-setup pipeline used a separate “many small clusters with hundreds of topics” model (Anil Kumar, 2016). Each team maintained its own cluster, limiting cross-team coordination and blast radius.

WCNP and hybrid cloud (current)

Consumer applications now run as container images on WCNP (Walmart Cloud Native Platform), an enterprise-grade Kubernetes-based multi-cloud container orchestration framework spanning private cloud and public cloud (Azure and Google Cloud). The MPS architecture introduced in 2024 targets stateless consumer services that auto-scale on WCNP without touching Kafka partition counts.

Kafka Streams as a distributed NoSQL database

The Customer Backbone team uses Kafka Streams not just for stream processing but as the stateful data store for the Customer Data Platform. Each customer profile is maintained in RocksDB. Enriched profiles feed downstream Kafka Streams applications for recommendations and fraud detection, creating a graph of event-driven microservices. This pattern requires custom extensions to make Kafka Streams operationally stable at scale (see Special Techniques below).

Replenishment architecture

The replenishment system uses 18 Kafka brokers with 500+ partitions per topic. CDC feeds a denormalised view that passes through a planning engine. The system operates in an active-passive replication model between datacentres and includes a fallback REST service to a database if Kafka becomes unavailable.

Cassandra CDC pipeline

Debezium 1.9 reads Cassandra 4.x commit logs continuously and publishes changes to Kafka. A Redis + RedisBloom (Bloom filter) deduplication layer, partitioned hourly, handles the 9x record fan-out caused by 3-region, replication-factor-3 deployments. Walmart’s internal Data Acquisition Tool (DAQ) orchestrates the pipeline on Azure.

Druid analytics pipeline

Kafka delivers events to Apache Storm (using Trident), which enriches them via a custom log scraper and writes to Druid. Druid pre-aggregates (rollup) at ingestion and stores data in inverted-index, columnar format for sub-second OLAP queries.

Producer architecture

For the replenishment system, Suman Pattnaik documented the following producer configuration in 2022:

Consumer architecture

Consumer configuration for the replenishment system includes: max.poll.records and max.poll.interval.ms explicitly tuned, auto-commit disabled, and session timeout and heartbeat interval explicitly set.

As of 2024, many consumer workloads are decoupled from direct Kafka consumption via the Messaging Proxy Service (described below).

Kafka Connect ecosystem

Kafka Connect is used to sink topics to HDFS for long-term storage, with max.poll.interval.ms, max.poll.records, and session.timeout.ms tuned for the HDFS connector. Debezium (acting as a Kafka Connect source) handles Cassandra CDC ingestion.

Special techniques and engineering innovations

Messaging Proxy Service (MPS)

The most significant architectural innovation in Walmart’s recent Kafka history is the Messaging Proxy Service, published by Ravinder Matte, Vilas Athavale, Sid Anand, and colleagues in June 2024.

The standard approach to scaling Kafka consumers is to increase partition count to match parallelism. The problem with this at Walmart’s scale is that consumer group rebalances become frequent and disruptive: pod scaling events, rolling deployments, or processing delays exceeding max.poll.interval.ms all trigger rebalances that cause consumer lag and SLA violations.

MPS inserts an HTTP proxy layer between Kafka and consumer services:

Consumer pods become stateless services that can auto-scale on Kubernetes without triggering Kafka rebalances. Consumer count and partition count are fully decoupled. According to the Walmart engineering team, this eliminated most rebalances in production, with the exception of rare cluster restarts or network events.

Cold Bootstrap for Kafka Streams recovery

Deepak Goyal presented this technique at Kafka Summit NYC 2019. Kafka Streams recovers stateful tasks by replaying changelog topics, but for the Customer Data Platform’s large RocksDB stores, those changelogs held gigabytes of data, making standby recovery very slow.

Cold Bootstrap replaces changelog replay with a direct RocksDB snapshot copy: when a node fails, the standby copies the active node’s RocksDB snapshot directly via JSch, then resumes from the saved offset. This achieves zero event loss with significantly faster recovery, and eliminates the need for indefinite changelog topic retention.

Dynamic repartitioning in Kafka Streams

Goyal’s team added support for elastic repartitioning of Kafka Streams state across new partitions at runtime, enabling scaling to any number of partitions and nodes without a full restart. This was not available in stock Kafka Streams at the time.

Rack/AZ-aware task assignment

Active and standby Kafka Streams tasks for the same partition are explicitly assigned to different racks or availability zones, improving resilience without additional hardware.

Availability-first Kafka Streams (KIP-535 and KIP-562)

Navinder Pal Singh Brar presented this work at Kafka Summit 2020. The fraud detection application requires Kafka Streams reads on every transaction, but Kafka Streams’ default behaviour blocks reads during consumer rebalances. At Walmart’s transaction volume, this was incompatible with latency requirements.

Walmart contributed two KIPs to the Apache Kafka project:

Both were merged into Apache Kafka. Brar holds four US patents related to Kafka Streams and was named a Confluent Community Catalyst.

Custom partitioner for dirty-write prevention

The replenishment system uses a custom partitioner based on a murmur hash function, ensuring each item-store combination always lands on a single partition and is consumed by a single consumer. This prevents database deadlocks from concurrent writes that would occur if the same item-store pair were processed by multiple consumers simultaneously.

Bloom-filter deduplication for Cassandra CDC

Each Cassandra write in a 3-region, RF=3 cluster generates 9 CDC records. Out-of-order delivery and partial column updates compound the problem. Walmart uses Redis-backed RedisBloom Bloom filters partitioned hourly, with error rates tested at 1 in 1 million, to deduplicate records before they reach downstream consumers. Production configuration achieved 623,000 deduplicated records per minute across 3 nodes.

Walmart also enhanced Debezium to process Cassandra commit logs without waiting for log file completion, reducing CDC latency further.

Operating Kafka at scale

Deployment model

Walmart’s Kafka infrastructure spans self-managed clusters on private cloud (via OneOps historically, WCNP currently) and public cloud (Azure and Google Cloud). The deployment is hybrid: Kafka brokers remain within Walmart’s infrastructure while consumer workloads run on WCNP across clouds.

Monitoring stack (2016)

Ning Zhang documented the monitoring stack in December 2016. Walmart forked Yahoo’s Kafka Manager and added custom monitoring graphs. jmxtrans bridges Kafka JMX metrics to Graphite and Ganglia backends; Grafana is layered on top of Graphite for dashboards. Alerts covered: Kafka process down, under-replicated partitions, leader loss, low disk space, and high CPU utilisation.

No sourced material describes how the monitoring stack evolved post-2016 when Walmart moved to WCNP and Kubernetes.

Producer and consumer tuning

For the replenishment system, Pattnaik documented explicit tuning of linger.ms, batch.size (16,000 bytes), and acks=all on the producer side, with max.poll.records, max.poll.interval.ms, and session timeout configured on the consumer side.

Broker hardware

For the inventory system, Suman Pattnaik specified that brokers use multiple disks with RAID configurations for log.dirs storage, with consumer counts aligned to partition counts and CPU/memory maintained at around 50% utilisation headroom.

Developer experience

The self-serving OneOps model (documented by Ning Zhang in 2016) gave teams a GUI for deploying dedicated Kafka clusters, covering operations, monitoring, auto-repair, auto-replacement, and auto-scaling. Anil Kumar’s team built internal tooling for tracking pipeline flows, SLA metrics, message send/receive latencies per producer and consumer, and alerting on backlogs and throughput degradation. A Kafka REST proxy was deployed so that non-JVM services could produce and consume without native Kafka client libraries.

Challenges and how they solved them

Consumer rebalancing causing lag at scale

Ravinder Matte’s team described consumer group rebalancing as “the most common challenge in operationalising Kafka consumers at scale” in the June 2024 post. At Walmart’s scale — 25,000+ consumers — rebalances triggered by pod scaling, rolling deployments, or processing delays exceeding max.poll.interval.ms caused high consumer lag that disrupted SLAs.

Solution: The Messaging Proxy Service decouples Kafka partition consumption from consumer services. Most rebalances are now eliminated except for rare cluster restarts or network events.

Slow Kafka Streams standby recovery

Changelog topics holding gigabytes of state for the Customer Data Platform caused very slow recovery when a Kafka Streams node failed, as the standby had to replay the entire log.

Solution: Cold Bootstrap — the standby copies the active node’s RocksDB snapshot directly via JSch, then resumes from the saved offset. This avoids replaying the changelog, achieves zero event loss, and eliminates the need for indefinite changelog retention. (Deepak Goyal, Kafka Summit NYC 2019.)

Inability to elastically scale Kafka Streams

Kafka Streams lacked native repartitioning support, which prevented the Customer Data Platform from scaling its stateful applications without full restarts.

Solution: Walmart added dynamic repartitioning to Kafka Streams, redistributing state across new partitions at runtime. This was later contributed to the community via the KIP process. (Deepak Goyal, Kafka Summit NYC 2019.)

Availability vs. consistency in fraud detection

Kafka Streams’ default behaviour blocks reads during consumer rebalances. For a fraud detection application running on every transaction, this was incompatible with latency SLAs.

Solution: Walmart contributed KIP-535 and KIP-562 to Apache Kafka, explicitly choosing availability over strict consistency for this workload. “You’re basically trading consistency for availability,” Brar stated in a TechTarget interview. (Navinder Pal Singh Brar, Kafka Summit 2020.)

Shared cluster resource contention

Early shared bare-metal clusters suffered from capacity competition between tenants, no authentication or isolation, and buggy clients exhausting resources.

Solution: Migrated to dedicated clusters via OneOps, giving each team its own Kafka cluster with GUI-driven operations, auto-repair, and auto-scaling. (Ning Zhang, November 2016.)

Cassandra CDC record fan-out

A single Cassandra write in a 3-region, RF=3 deployment generates 9 CDC records, arriving out of order and containing only changed columns.

Solution: Debezium 1.9 in continuous log processing mode, combined with Redis-backed RedisBloom Bloom filters partitioned hourly, deduplicates records at 60,000 changes per second in production. Walmart also enhanced Debezium to process commit logs without waiting for file completion. (Scott Harvester and Nitin Chhabra, July 2022.)

Poison-pill messages causing head-of-line blocking

Unprocessable messages blocked partition consumption in direct consumer deployments.

Solution: The Messaging Proxy Service routes failed messages to a Dead Letter Queue with retry logic, isolating them from the main processing path. (Ravinder Matte et al., June 2024.)

Full tech stack

Category Tools Notes
Message broker Apache Kafka Kafka 0.10.1.0 documented in 2016; current version not specified in sourced material
Stream processing Kafka Streams, Apache Storm (Trident), Apache Spark Streaming Kafka Streams: Customer Data Platform, fraud detection, inventory; Storm: Druid ingestion; Spark: inventory and A/B testing
State store RocksDB Local state store for Kafka Streams applications in the CDP
Connectors Kafka Connect (HDFS sink), Debezium 1.9 (Cassandra CDC source) HDFS connector for long-term storage; Debezium enhanced for continuous log processing
Cross-datacenter replication Kafka MirrorMaker (patched) Internal patches added complete topic renaming to prevent name collisions across sites (2016)
HTTP interface Kafka REST Proxy Enables non-JVM producers and consumers (2016)
Analytics Apache Druid Sub-second OLAP queries on nearly 1 billion+ events/day; pre-aggregation at ingestion
Operational database Apache Cassandra Primary operational store for inventory and customer data; CDC source for Debezium
Deduplication Redis, RedisBloom Bloom-filter deduplication for Cassandra CDC fan-out at 60,000 changes/second
Storage sinks Apache HDFS, Apache Hudi HDFS via Kafka Connect for long-term retention; Hudi for lakehouse table format (as of 2023)
Monitoring (2016) Kafka Manager (Yahoo fork, enhanced), jmxtrans, Graphite, Ganglia, Grafana Custom monitoring graphs added to Kafka Manager; Grafana layered on Graphite
Cluster coordination Apache ZooKeeper Referenced in 2019 material; current coordination mechanism not specified
Deployment / infra (legacy) OneOps (OpenStack-based) Self-serve Kafka cluster deployment with GUI, auto-repair, and auto-scaling (pre-WCNP)
Deployment / infra (current) WCNP (Walmart Cloud Native Platform) Kubernetes-based multi-cloud orchestration spanning Azure and Google Cloud
Internal proxy Messaging Proxy Service (MPS) Internal HTTP proxy decoupling Kafka consumption from consumer pod scaling (2024)
Client libraries Spring Kafka, Akka Streams Spring Kafka in replenishment system; Akka Streams for early reactive microservice consumption (2016)
Public cloud Azure, Google Cloud Both used for WCNP workloads; Azure hosts Cassandra CDC pipeline

Key contributors

Key takeaways for your own Kafka implementation

Sources and further reading

  1. Ning Zhang, “Tech Transformation: Real-time Messaging at Walmart,” Walmart Global Tech Blog, November 2016
  2. Ning Zhang, “Kafka Ecosystem on Walmart’s Cloud,” Walmart Global Tech Blog, December 2016
  3. Anil Kumar, “Apache Kafka for Item Setup at Walmart,” Confluent Blog (Walmart Labs), October 2016
  4. Amaresh Nayak, “Event Stream Analytics at Walmart with Druid,” Walmart Global Tech Blog, November 2017
  5. Suman Pattnaik and Prasanna Subburaj, “When Kafka Meets the Scaling and Reliability Needs of World’s Largest Retailer,” Kafka Summit SF 2019
  6. Deepak Goyal, “Kafka Streams at Scale: Walmart’s Approach,” Kafka Summit NYC 2019
  7. Navinder Pal Singh Brar, “Multi-tenant Kafka Streams CDP at Walmart,” Strata Data Conference New York, September 2019
  8. Suman Pattnaik, “Real-time Inventory Management at Walmart Using Kafka,” Confluent Blog, May 2020
  9. Navinder Pal Singh Brar, “Availability-first Kafka Streams at Walmart Scale,” Kafka Summit 2020
  10. Sean Michael Kerner (citing Navinder Pal Singh Brar), “Kafka Users Northrop Grumman, Walmart Highlight Event Streaming,” TechTarget, August 2020
  11. Suman Pattnaik, “How Walmart Uses Kafka for Real-time Omnichannel Replenishment,” Confluent Blog, May 2022
  12. Scott Harvester and Nitin Chhabra, “Walmart’s Cassandra CDC Solution,” Walmart Global Tech Blog, July 2022
  13. Samuel Guleff, “Lakehouse at Fortune 1 Scale,” Walmart Global Tech Blog, May 2023
  14. Ravinder Matte, Vilas Athavale, Sid Anand et al., “Reliably Processing Trillions of Kafka Messages Per Day,” Walmart Global Tech Blog, June 2024

If you’re monitoring a Kafka environment at scale, Kpow offers a free 30-day trial that connects to any Kafka cluster in minutes and deploys via Docker, Helm, or JAR.