How Apple uses Apache Kafka in production

Apple runs Apache Kafka as a shared internal platform serving multiple engineering teams across the company. Since at least 2018, Apple’s engineers have presented at Kafka Summit on the operational realities of running Kafka as a managed, multi-tenant service at the scale you would expect from one of the world’s largest technology companies. Their publicly documented work spans partition management at scale, a custom zero-data-movement balancing algorithm, Kubernetes-native tiered storage, and an ongoing migration from password-based authentication to mTLS.

Company overview

Apple designs hardware, software, and services including the iPhone, iPad, Mac, Apple Watch, iCloud, Apple Music, Apple TV+, and the App Store. Those services operate at a scale that requires low-latency, fault-tolerant data infrastructure: iCloud alone serves hundreds of millions of users, while the App Store processes purchases across more than 175 countries.

Kafka sits within the data infrastructure layer that keeps these services fed with real-time event data. The iCloud Data organisation, for example, uses stream-processing technologies that include Kafka for building scalable data pipelines. Apple began operating Kafka as a multi-tenant internal service no later than 2018, when they first presented on the architecture at Kafka Summit London.

Key Kafka milestones:

Apple’s Kafka use cases

Apple’s Kafka deployment centres on internal data infrastructure rather than a single named product use case. Their engineering talks describe Kafka as a platform shared across teams, which is consistent with how large technology companies typically use Kafka at enterprise scale.

The iCloud Data organisation cites Kafka as part of the technology set used to build scalable, timely data pipelines for internal teams that ship product features. Apple’s job postings for iCloud data engineering roles reference Kafka alongside Apache Flink, Kafka Streams, and Apache Spark Streaming, suggesting that Kafka is used for real-time data movement between services as well as stream processing.

The Kafka as a Service model Apple described in 2018 implies a broad adoption pattern: rather than individual teams standing up their own clusters, a central platform team manages shared clusters, and product teams onboard as tenants.

Scale and throughput

Apple does not publish aggregate cluster statistics publicly. The figures available come directly from conference presentations:

Apple’s Kafka architecture

Apple’s Kafka clusters run on Kubernetes using the Strimzi operator on AWS EKS. Broker nodes use Amazon EBS volumes for local storage. Helm is used to manage Strimzi deployments across environments. This Kubernetes-native approach means cluster lifecycle operations, scaling, and upgrades are managed through Kubernetes custom resources rather than direct broker administration.

Since February 2024, Apple has operated Kafka with tiered storage (KIP-405) in production. Older log segments are offloaded to Amazon S3, separating data retention from local broker capacity. The Remote Log Metadata Manager (RLMM) uses an internal Kafka client secured with SSL/TLS using a PKCS12 keystore and truststore, configured via the rsm.config prefix.

Security is layered. Apple has operated Kafka with SASL authentication and ACL-based authorisation since at least 2018. As of 2025, the platform team is actively migrating toward mTLS using short-lived X.509 certificates, driven by the security advantages of certificate-based identity over long-lived passwords in a multi-tenant environment.

Producer and consumer architecture

Apple’s partition placement algorithm (described below) is built on the assumption that data ingestion workloads assign events randomly to partitions with no ordering requirements, and that all partitions from a topic are consumed evenly. This is consistent with a high-throughput event ingestion pattern where producers distribute load across partitions and consumers process each partition independently.

Consumer lag monitoring is handled by Burrow, which is deployed alongside CMAK for cluster administration.

Operating model

The platform is provided to internal teams as a managed service. This means the central Kafka team is responsible for cluster provisioning, upgrades, monitoring, and operational tooling, while product teams interact with the platform as tenants. Multi-tenancy brings authentication, authorisation, quota management, and topic namespace concerns, all of which Apple has addressed in successive talks.

Special techniques and engineering innovations

Zero-data-movement cluster balancing

Apple engineers Haochen Li and Yaodong Yang developed a partition placement strategy that achieves optimal cluster load balancing without moving any data between brokers. Traditional rebalancing tools, including Cruise Control, rebalance clusters by reassigning partitions, which involves copying data from one broker to another. Apple’s approach avoids that cost entirely by placing partitions correctly at the point of creation.

The algorithm enforces two properties across every topic: the number of replicas per broker is equal, and the number of leader replicas per broker is equal. The formula applied is partition_count = scale_number * broker_count, where scale_number is the number of leader replicas per broker for a topic. This is applied at topic creation, when partition count changes, and when brokers are added to a cluster.

The strategy is implemented as a Topic Operator and has been deployed to production. The result, as described by Haochen Li and Yaodong Yang, was well-balanced clusters with measurable storage cost savings and no performance degradation. The team submitted a Kafka Improvement Proposal (KIP) to upstream the approach to the Apache Kafka project.

Tiered storage on Strimzi

Apple ran a structured integration effort to bring Kafka tiered storage (KIP-405) into production on their Strimzi-managed clusters. The work began in February 2023 with an evaluation using the KIP-405 author’s Kafka 2.8.x development branch and progressed through a series of phases:

The integration required upgrading Strimzi from 0.27.x to 0.38.x and Java from 11 to 17. Apple contributed native tiered storage support back to the Strimzi project, which was merged in Strimzi 0.40.0. This means you can now configure a custom remote storage manager in Strimzi via standard Kafka custom resource definitions, a capability that originated from Apple’s production work.

Performance tuning for tiered storage required several configuration adjustments: max.fetch.wait.time was increased to tolerate the additional latency of remote object fetches; remote.log.reader.threads and thread pool sizes were tuned for throughput; log segment sizes were optimised to keep RLMM metadata overhead manageable; and AWS multipart uploads were configured to run concurrently alongside range fetch APIs for parallel I/O.

mTLS migration

Gaurav Narula from Apple described a hybrid migration strategy for moving from SASL password authentication to mTLS. The challenge with any mTLS migration in a large shared platform is that you cannot require all clients to switch simultaneously without breaking production workloads.

Apple’s approach adds mTLS support directly to the SASL listener, so both authentication methods are accepted on the same port. Existing clients continue to authenticate with passwords while new or migrated clients present X.509 certificates. KafkaPrincipal identities are preserved across both methods, which means ACLs and quota configurations do not need to change when a client migrates. For inter-broker communication, each broker is configured with distinct server and client certificates.

The use of short-lived X.509 certificates addresses two specific weaknesses of SASL passwords: credential leak impact is bounded by certificate lifetime, and certificate-based authentication is not vulnerable to brute-force attacks.

Operating Kafka at scale

Kubernetes-native operations: Strimzi handles cluster provisioning, rolling upgrades, and configuration changes through Kubernetes custom resources. This removes the need for manual broker-level administration and integrates Kafka lifecycle management with existing container platform tooling.

Monitoring and observability: Burrow provides consumer lag monitoring. CMAK provides a cluster administration interface. Both run alongside the Strimzi operator in the Kafka platform stack.

Cluster rebalancing: Cruise Control is deployed for automated partition load balancing, though Apple’s zero-data-movement algorithm addresses a significant portion of imbalance at topic-creation time rather than reactively.

Partition reassignment at scale: Noa Resare’s 2019 talk described the operational challenge of managing large-scale partition reassignments: tracking progress, understanding the impact on producers and consumers during broker restarts, and applying debugging and mitigation strategies when reassignments do not proceed as expected. At large partition counts, these operations become significant engineering events rather than routine maintenance.

Upgrade strategy: The tiered storage work illustrates Apple’s approach to major upgrades: start with an evaluation phase on a development branch, build a prototype integration, run dogfooding before committing to production, and execute the production rollout as a distinct step. The Kafka 2.8 to 3.6 upgrade and the Strimzi 0.27 to 0.38 jump were managed as part of the tiered storage project rather than independently.

Challenges and how they solved them

Cluster imbalance without data movement

As Kafka clusters grow or traffic patterns shift, broker resource utilisation becomes uneven. CPU, disk, and leader distribution diverge, leading to over-provisioned capacity on some brokers and bottlenecks on others. The conventional remedy is partition reassignment, but at scale this involves copying large volumes of data between brokers, creating I/O pressure and operational risk.

Apple developed the zero-data-movement placement algorithm to solve this at the source. By enforcing balanced placement at topic-creation time, the problem of runtime imbalance is largely avoided. When it does occur, during cluster expansion or partition scaling, the same algorithm is applied to achieve balance without data movement.

Cost of EBS-backed data retention

Storing all log data on EBS volumes attached to broker nodes means storage costs scale directly with retention windows. For teams that need to retain large event histories for backfill, reprocessing, or compliance, this becomes a significant expense. Tiered storage decouples retention from broker capacity: older segments are offloaded to S3, which is substantially cheaper than EBS, while brokers remain sized for hot data and throughput.

The challenge Apple faced was that Strimzi did not natively support tiered storage configuration at the time. Solving this required building the integration, testing it across multiple Kafka versions, and contributing the result back to Strimzi before production deployment.

Password-based authentication in a multi-tenant service

In a shared Kafka platform with many producer and consumer clients, SASL password credentials create a persistent risk surface. Long-lived credentials can be leaked and remain valid indefinitely. They are also vulnerable to brute-force attacks in a way that certificate-based authentication is not.

The transition to mTLS eliminates these risks, but migrating a large multi-tenant platform is not straightforward: you cannot break existing clients while switching authentication mechanisms. Apple’s hybrid SASL-plus-mTLS approach on a shared listener resolves this by making the migration incremental and non-disruptive.

Large-scale partition reassignment

At the partition counts Apple operates, reassigning partitions, whether to recover from broker imbalance, decommission a broker, or respond to capacity changes, becomes a complex operational exercise. Tracking progress across many concurrent reassignments, managing the impact on producers and consumers, and debugging failures when they occur requires both tooling and operational discipline. Noa Resare’s 2019 talk shared the lessons Apple accumulated from running these operations repeatedly at scale.

Full tech stack

Category Tools Notes
Message broker Apache Kafka 3.6+ Core streaming platform; tiered storage available from 3.6
Kafka operator Strimzi Kubernetes-native cluster lifecycle management on EKS
Container platform AWS EKS Kubernetes runtime for broker and operator workloads
Local broker storage Amazon EBS Block storage attached to broker nodes
Tiered (remote) storage Amazon S3 (via KIP-405 plugin) Cost-effective long-term event retention
Packaging Helm Strimzi deployment management
Cluster rebalancing Cruise Control Automated partition load balancing
Consumer lag monitoring Burrow Consumer group lag visibility
Cluster administration CMAK Kafka cluster management UI
Authentication mTLS (X.509 / PKCS12), SASL Client and inter-broker authentication; migrating toward mTLS
JVM Java 17 Kafka broker and tooling runtime

Key contributors

Key takeaways for your own Kafka implementation

Sources and further reading

  1. Noa Resare, Apple — “Experiences Operating Apache Kafka® at Scale”, Kafka Summit NY 2019
  2. Apple — “Kafka as a Service: A Tale of Security and Multi Tenancy”, Kafka Summit London 2018, video
  3. Haochen Li and Yaodong Yang, Apple — “Balance Kafka Cluster with Zero Data Movement”, Kafka Summit London 2023
  4. Bo Gao and Lixin Yao, Apple — “Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Streaming Applications”, Kafka Summit London 2024
  5. Gaurav Narula, Apple — “Leave Your Passwords Behind: Embracing mTLS in Kafka”, Confluent Current 2025
  6. dttung2905/kafka-in-production — Apple entries

If you work with a Kafka cluster and want visibility into consumer lag, broker health, and topic throughput without standing up additional infrastructure, try Kpow free for 30 days. You can connect it to any Kafka cluster in minutes and deploy via Docker, Helm, or JAR.