How Apple uses Apache Kafka in production

factor-houseJune 2, 202612 min read

Apple runs Apache Kafka as a shared internal platform serving multiple engineering teams across the company. Since at least 2018, Apple’s engineers have presented at Kafka Summit on the operational realities of running Kafka as a managed, multi-tenant service at the scale you would expect from one of the world’s largest technology companies. Their publicly documented work spans partition management at scale, a custom zero-data-movement balancing algorithm, Kubernetes-native tiered storage, and an ongoing migration from password-based authentication to mTLS.

Company overview

Apple designs hardware, software, and services including the iPhone, iPad, Mac, Apple Watch, iCloud, Apple Music, Apple TV+, and the App Store. Those services operate at a scale that requires low-latency, fault-tolerant data infrastructure: iCloud alone serves hundreds of millions of users, while the App Store processes purchases across more than 175 countries.

Kafka sits within the data infrastructure layer that keeps these services fed with real-time event data. The iCloud Data organisation, for example, uses stream-processing technologies that include Kafka for building scalable data pipelines. Apple began operating Kafka as a multi-tenant internal service no later than 2018, when they first presented on the architecture at Kafka Summit London.

Key Kafka milestones:

2018: Apple presents “Kafka as a Service: A Tale of Security and Multi Tenancy” at Kafka Summit London, confirming that Kafka is operated as a managed internal platform with multi-tenancy and authentication controls.
2019: Noa Resare presents operational lessons from running Kafka at large scale at Kafka Summit New York, covering partition reassignment challenges at scale.
2023: Haochen Li and Yaodong Yang present a zero-data-movement partition placement algorithm at Kafka Summit London and deploy it to production as a Topic Operator.
February 2024: Tiered storage on Kafka 3.6 reaches production, following a year-long integration effort using Strimzi on AWS EKS.
2024: Apple’s tiered storage contribution is merged into Strimzi 0.40.0, making native tiered storage support available to the wider community.
2025: Gaurav Narula presents a hybrid mTLS migration strategy at Confluent Current, describing how Apple is moving from SASL password authentication to short-lived X.509 certificates.

Apple’s Kafka use cases

Apple’s Kafka deployment centres on internal data infrastructure rather than a single named product use case. Their engineering talks describe Kafka as a platform shared across teams, which is consistent with how large technology companies typically use Kafka at enterprise scale.

The iCloud Data organisation cites Kafka as part of the technology set used to build scalable, timely data pipelines for internal teams that ship product features. Apple’s job postings for iCloud data engineering roles reference Kafka alongside Apache Flink, Kafka Streams, and Apache Spark Streaming, suggesting that Kafka is used for real-time data movement between services as well as stream processing.

The Kafka as a Service model Apple described in 2018 implies a broad adoption pattern: rather than individual teams standing up their own clusters, a central platform team manages shared clusters, and product teams onboard as tenants.

Scale and throughput

Apple does not publish aggregate cluster statistics publicly. The figures available come directly from conference presentations:

Clusters: Multiple Kafka clusters operated as a shared internal service; exact count not publicly disclosed.
Brokers per cluster: 6 brokers confirmed in the 2024 tiered storage production environment.
Per-broker throughput: Measured in the Strimzi/tiered storage test environment: 270 MB/s with a single consumer; 250 MB/s with three consumers; 90 MB/s with five consumers across 400 topics.
Partitions: Described as reaching a volume where large-scale reassignment operations are a regular operational concern; exact count not publicly disclosed.
Storage: EBS-backed local storage per broker, supplemented by S3 tiered storage for older data following the 2024 production rollout.

Apple’s Kafka architecture

Apple’s Kafka clusters run on Kubernetes using the Strimzi operator on AWS EKS. Broker nodes use Amazon EBS volumes for local storage. Helm is used to manage Strimzi deployments across environments. This Kubernetes-native approach means cluster lifecycle operations, scaling, and upgrades are managed through Kubernetes custom resources rather than direct broker administration.

Since February 2024, Apple has operated Kafka with tiered storage (KIP-405) in production. Older log segments are offloaded to Amazon S3, separating data retention from local broker capacity. The Remote Log Metadata Manager (RLMM) uses an internal Kafka client secured with SSL/TLS using a PKCS12 keystore and truststore, configured via the rsm.config prefix.

Security is layered. Apple has operated Kafka with SASL authentication and ACL-based authorisation since at least 2018. As of 2025, the platform team is actively migrating toward mTLS using short-lived X.509 certificates, driven by the security advantages of certificate-based identity over long-lived passwords in a multi-tenant environment.

Producer and consumer architecture

Apple’s partition placement algorithm (described below) is built on the assumption that data ingestion workloads assign events randomly to partitions with no ordering requirements, and that all partitions from a topic are consumed evenly. This is consistent with a high-throughput event ingestion pattern where producers distribute load across partitions and consumers process each partition independently.

Consumer lag monitoring is handled by Burrow, which is deployed alongside CMAK for cluster administration.

Operating model

The platform is provided to internal teams as a managed service. This means the central Kafka team is responsible for cluster provisioning, upgrades, monitoring, and operational tooling, while product teams interact with the platform as tenants. Multi-tenancy brings authentication, authorisation, quota management, and topic namespace concerns, all of which Apple has addressed in successive talks.

Special techniques and engineering innovations

Zero-data-movement cluster balancing

Apple engineers Haochen Li and Yaodong Yang developed a partition placement strategy that achieves optimal cluster load balancing without moving any data between brokers. Traditional rebalancing tools, including Cruise Control, rebalance clusters by reassigning partitions, which involves copying data from one broker to another. Apple’s approach avoids that cost entirely by placing partitions correctly at the point of creation.

The algorithm enforces two properties across every topic: the number of replicas per broker is equal, and the number of leader replicas per broker is equal. The formula applied is partition_count = scale_number * broker_count, where scale_number is the number of leader replicas per broker for a topic. This is applied at topic creation, when partition count changes, and when brokers are added to a cluster.

The strategy is implemented as a Topic Operator and has been deployed to production. The result, as described by Haochen Li and Yaodong Yang, was well-balanced clusters with measurable storage cost savings and no performance degradation. The team submitted a Kafka Improvement Proposal (KIP) to upstream the approach to the Apache Kafka project.

Tiered storage on Strimzi

Apple ran a structured integration effort to bring Kafka tiered storage (KIP-405) into production on their Strimzi-managed clusters. The work began in February 2023 with an evaluation using the KIP-405 author’s Kafka 2.8.x development branch and progressed through a series of phases:

April 2023: Prototype integration of Strimzi and S3 plugin.
September 2023: Kafka 2.8 development branch fully tested.
November 2023: Kafka 3.6.0 dogfooding phase.
February 2024: Production rollout on Kafka 3.6.

The integration required upgrading Strimzi from 0.27.x to 0.38.x and Java from 11 to 17. Apple contributed native tiered storage support back to the Strimzi project, which was merged in Strimzi 0.40.0. This means you can now configure a custom remote storage manager in Strimzi via standard Kafka custom resource definitions, a capability that originated from Apple’s production work.

Performance tuning for tiered storage required several configuration adjustments: max.fetch.wait.time was increased to tolerate the additional latency of remote object fetches; remote.log.reader.threads and thread pool sizes were tuned for throughput; log segment sizes were optimised to keep RLMM metadata overhead manageable; and AWS multipart uploads were configured to run concurrently alongside range fetch APIs for parallel I/O.

mTLS migration

Gaurav Narula from Apple described a hybrid migration strategy for moving from SASL password authentication to mTLS. The challenge with any mTLS migration in a large shared platform is that you cannot require all clients to switch simultaneously without breaking production workloads.

Apple’s approach adds mTLS support directly to the SASL listener, so both authentication methods are accepted on the same port. Existing clients continue to authenticate with passwords while new or migrated clients present X.509 certificates. KafkaPrincipal identities are preserved across both methods, which means ACLs and quota configurations do not need to change when a client migrates. For inter-broker communication, each broker is configured with distinct server and client certificates.

The use of short-lived X.509 certificates addresses two specific weaknesses of SASL passwords: credential leak impact is bounded by certificate lifetime, and certificate-based authentication is not vulnerable to brute-force attacks.

Operating Kafka at scale

Kubernetes-native operations: Strimzi handles cluster provisioning, rolling upgrades, and configuration changes through Kubernetes custom resources. This removes the need for manual broker-level administration and integrates Kafka lifecycle management with existing container platform tooling.

Monitoring and observability: Burrow provides consumer lag monitoring. CMAK provides a cluster administration interface. Both run alongside the Strimzi operator in the Kafka platform stack.

Cluster rebalancing: Cruise Control is deployed for automated partition load balancing, though Apple’s zero-data-movement algorithm addresses a significant portion of imbalance at topic-creation time rather than reactively.

Partition reassignment at scale: Noa Resare’s 2019 talk described the operational challenge of managing large-scale partition reassignments: tracking progress, understanding the impact on producers and consumers during broker restarts, and applying debugging and mitigation strategies when reassignments do not proceed as expected. At large partition counts, these operations become significant engineering events rather than routine maintenance.

Upgrade strategy: The tiered storage work illustrates Apple’s approach to major upgrades: start with an evaluation phase on a development branch, build a prototype integration, run dogfooding before committing to production, and execute the production rollout as a distinct step. The Kafka 2.8 to 3.6 upgrade and the Strimzi 0.27 to 0.38 jump were managed as part of the tiered storage project rather than independently.

Challenges and how they solved them

Cluster imbalance without data movement

As Kafka clusters grow or traffic patterns shift, broker resource utilisation becomes uneven. CPU, disk, and leader distribution diverge, leading to over-provisioned capacity on some brokers and bottlenecks on others. The conventional remedy is partition reassignment, but at scale this involves copying large volumes of data between brokers, creating I/O pressure and operational risk.

Apple developed the zero-data-movement placement algorithm to solve this at the source. By enforcing balanced placement at topic-creation time, the problem of runtime imbalance is largely avoided. When it does occur, during cluster expansion or partition scaling, the same algorithm is applied to achieve balance without data movement.

Cost of EBS-backed data retention

Storing all log data on EBS volumes attached to broker nodes means storage costs scale directly with retention windows. For teams that need to retain large event histories for backfill, reprocessing, or compliance, this becomes a significant expense. Tiered storage decouples retention from broker capacity: older segments are offloaded to S3, which is substantially cheaper than EBS, while brokers remain sized for hot data and throughput.

The challenge Apple faced was that Strimzi did not natively support tiered storage configuration at the time. Solving this required building the integration, testing it across multiple Kafka versions, and contributing the result back to Strimzi before production deployment.

Password-based authentication in a multi-tenant service

In a shared Kafka platform with many producer and consumer clients, SASL password credentials create a persistent risk surface. Long-lived credentials can be leaked and remain valid indefinitely. They are also vulnerable to brute-force attacks in a way that certificate-based authentication is not.

The transition to mTLS eliminates these risks, but migrating a large multi-tenant platform is not straightforward: you cannot break existing clients while switching authentication mechanisms. Apple’s hybrid SASL-plus-mTLS approach on a shared listener resolves this by making the migration incremental and non-disruptive.

Large-scale partition reassignment

At the partition counts Apple operates, reassigning partitions, whether to recover from broker imbalance, decommission a broker, or respond to capacity changes, becomes a complex operational exercise. Tracking progress across many concurrent reassignments, managing the impact on producers and consumers, and debugging failures when they occur requires both tooling and operational discipline. Noa Resare’s 2019 talk shared the lessons Apple accumulated from running these operations repeatedly at scale.

Full tech stack

Category	Tools	Notes
Message broker	Apache Kafka 3.6+	Core streaming platform; tiered storage available from 3.6
Kafka operator	Strimzi	Kubernetes-native cluster lifecycle management on EKS
Container platform	AWS EKS	Kubernetes runtime for broker and operator workloads
Local broker storage	Amazon EBS	Block storage attached to broker nodes
Tiered (remote) storage	Amazon S3 (via KIP-405 plugin)	Cost-effective long-term event retention
Packaging	Helm	Strimzi deployment management
Cluster rebalancing	Cruise Control	Automated partition load balancing
Consumer lag monitoring	Burrow	Consumer group lag visibility
Cluster administration	CMAK	Kafka cluster management UI
Authentication	mTLS (X.509 / PKCS12), SASL	Client and inter-broker authentication; migrating toward mTLS
JVM	Java 17	Kafka broker and tooling runtime

Key contributors

Noa Resare — Apple engineer; spoke on operating Kafka at large scale, Kafka Summit NY 2019. Talk recording
Haochen Li — Apple software engineer; co-developed the zero-data-movement partition placement algorithm, Kafka Summit London 2023. Session
Yaodong Yang — Apple software engineer; co-developed the zero-data-movement partition placement algorithm, Kafka Summit London 2023.
Bo Gao — Apple engineer; led the Strimzi tiered storage integration project, Kafka Summit London 2024. Session
Lixin Yao — Apple engineer; co-led the Strimzi tiered storage integration project, Kafka Summit London 2024.
Gaurav Narula — Apple engineer; presented the hybrid mTLS migration approach, Confluent Current 2025. Session

Key takeaways for your own Kafka implementation

Address cluster imbalance at creation time, not at rebalance time. Apple’s zero-data-movement algorithm enforces balanced partition and leader placement when topics are created or scaled. This reduces or eliminates the need for reactive rebalancing operations, which are expensive in large clusters.
Tiered storage is a practical path to decoupling retention from broker capacity. Apple’s production journey from prototype (February 2023) to production (February 2024) demonstrates that tiered storage on Strimzi is achievable with deliberate testing phases, though it requires careful performance tuning, particularly around fetch latency and thread pool sizing.
Migrating authentication mechanisms in a shared platform requires a hybrid transition period. Adding mTLS to the existing SASL listener rather than replacing it allows clients to migrate at their own pace. Preserving KafkaPrincipal identities means existing ACLs remain valid throughout the transition.
If you operate Kafka as a managed service for internal teams, multi-tenancy concerns appear early. Security enforcement (authentication, authorisation, quotas), partition reassignment at scale, and cluster observability all become harder in a shared environment than in a single-team deployment. Apple’s public talks trace a consistent investment in each of these areas over several years.
Contributing upstream reduces long-term maintenance burden. Apple’s Strimzi tiered storage integration was contributed back to the project and merged in Strimzi 0.40.0. If you are building integrations on top of open source Kafka tooling, upstream contributions mean you are no longer carrying a fork.

Sources and further reading

Noa Resare, Apple — “Experiences Operating Apache Kafka® at Scale”, Kafka Summit NY 2019
Apple — “Kafka as a Service: A Tale of Security and Multi Tenancy”, Kafka Summit London 2018, video‍
Haochen Li and Yaodong Yang, Apple — “Balance Kafka Cluster with Zero Data Movement”, Kafka Summit London 2023
Bo Gao and Lixin Yao, Apple — “Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Streaming Applications”, Kafka Summit London 2024
Gaurav Narula, Apple — “Leave Your Passwords Behind: Embracing mTLS in Kafka”, Confluent Current 2025
dttung2905/kafka-in-production — Apple entries

If you work with a Kafka cluster and want visibility into consumer lag, broker health, and topic throughput without standing up additional infrastructure, try Kpow free for 30 days. You can connect it to any Kafka cluster in minutes and deploy via Docker, Helm, or JAR.