Kafka cluster monitoring

Kafka clusters are typically monitored from the broker up: process health checks, JVM metrics, disk alerts, per-broker request latencies. That instrumentation is necessary, but it has a blind spot. A broker can report normal metrics in isolation while the cluster as a whole is in a degraded state. Partition leadership can be unbalanced. Replication can be falling behind across the fleet. The controller can fail silently. None of these conditions are visible from a single broker’s perspective.

This article covers cluster-level monitoring: the aggregated, cross-broker view that reveals how the cluster is functioning as a unit. It covers the metrics that matter at this layer, how to collect them across a broker fleet, how to alert on them effectively, and how to diagnose the most common cluster-level failures.

Per-broker internals — request thread pools, JVM heap, disk I/O — are covered in the Kafka broker monitoring article. Consumer-side lag monitoring is covered in the Kafka consumer monitoring article. See the Kafka monitoring guide for the full picture.

Key takeaways

What is Kafka cluster monitoring?

Cluster monitoring is the practice of observing the aggregated state of all brokers rather than the internals of any single one. The distinction matters in practice.

Broker monitoring tells you whether an individual broker is healthy: its request thread utilisation, JVM heap pressure, disk throughput, and local ISR state. Cluster monitoring tells you whether the cluster as a whole is functioning: whether replication is consistent across the fleet, whether partition leadership is balanced, whether the controller is in a valid state, and whether the cluster has capacity headroom.

Both layers are necessary. Cluster health checks can miss per-broker pathologies that are not yet visible at the aggregate level. A JVM GC pause on one broker, for example, may not immediately register as a replication deficit on the cluster health dashboard. Conversely, per-broker monitoring alone will not surface a partition imbalance, a controller failure, or a sustained ISR shrink rate that only becomes meaningful when measured across the fleet.

The same JMX interface exposes data at both levels. The difference is in how you query and aggregate it.

Key cluster-level metrics

Cluster-level metrics fall into two categories: metrics that are properties of the cluster as a whole (controller state, partition distribution), and metrics that are only meaningful when aggregated across brokers (replication health, throughput). The four groups below cover both.

Control plane health

The controller is the broker responsible for partition leader elections and cluster metadata management. There is always exactly one active controller in a functioning cluster. Deviations are critical.

Metric name JMX MBean path Description Target / alert threshold
ActiveControllerCount kafka.controller:type=KafkaController,name=ActiveControllerCount Number of active controllers on this broker (0 or 1). Must sum to exactly 1 across all brokers. Alert (critical) if cluster-wide sum is not exactly 1
OfflinePartitionsCount kafka.controller:type=KafkaController,name=OfflinePartitionsCount Number of partitions with no available leader. Reported by the active controller. Alert (critical) if > 0
LeaderElectionRateAndTimeMs kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs Rate and duration of partition leader elections. Frequent or slow elections indicate controller instability or broker churn. Alert (warning) if rate is sustained above baseline; alert (critical) if P99 duration is elevated
EventQueueSize kafka.controller:type=ControllerEventManager,name=EventQueueSize Pending administrative events queued on the controller. Alert (warning) if > 100
AvgIdleRatio kafka.controller:type=ControllerEventManager,name=AvgIdleRatio Fraction of time the controller event thread is idle. A value approaching 0 indicates a saturated controller. Alert (warning) if approaching 0

KRaft mode: In KRaft mode (Kafka 3.3+, default from 4.0), the following metrics replace ZooKeeper-era control plane telemetry. Verify MBean paths against the Kafka version in use.

Metric name JMX MBean path Description Target / alert threshold
LastAppliedRecordLagMs kafka.controller:type=KafkaController,name=LastAppliedRecordLagMs Delay between metadata commits on the active controller and local application on standby controllers. Elevated values mean metadata changes are propagating slowly. Alert (warning) if elevated and sustained
MetadataErrorCount kafka.controller:type=KafkaController,name=MetadataErrorCount Count of failed metadata log operations. Alert (critical) if > 0
current-state (quorum state) kafka.server:type=raft-metrics,name=current-state State of the KRaft quorum member on this broker (Leader, Follower, Candidate). Alert if controller broker is not Leader
commit-latency-avg kafka.server:type=raft-metrics,name=commit-latency-avg Average latency for KRaft metadata log commits. Alert (warning) if elevated

A KRaft quorum of size n (typically an odd number: 3 or 5) requires a strict majority of floor(n/2) + 1 active nodes to elect a leader and accept metadata commits. If the active quorum drops below this threshold, the control plane enters a read-only mode. Brokers continue to process produce and consume requests for existing partitions, but topic creation, partition reassignment, and ISR state modifications are blocked. If a partition leader fails while the quorum is unavailable, client requests to that partition will time out.

Replication fleet health

These are the most operationally significant cluster-level metrics. They reflect whether data is being replicated as configured across all brokers, not just on one.

Metric name JMX MBean path Description Target / alert threshold
UnderReplicatedPartitions kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Partitions where the current ISR count is less than the replication factor. Sum across all brokers for the cluster total. Alert (critical) if sum > 0 for more than 5 minutes
UnderMinIsrPartitionCount kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount Partitions with fewer ISRs than min.insync.replicas. Producers configured with acks=all will receive errors. Alert (critical) if > 0
UncleanLeaderElectionsPerSec kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec Rate of elections where a non-ISR replica was promoted to leader. Indicates data loss has occurred or is imminent. Alert (critical) if > 0
IsrShrinksPerSec kafka.server:type=ReplicaManager,name=IsrShrinksPerSec Rate at which replicas are leaving the ISR. Track per broker and as a cluster sum. Alert (warning) if sustained above zero
IsrExpandsPerSec kafka.server:type=ReplicaManager,name=IsrExpandsPerSec Rate at which replicas are rejoining the ISR. Monitor alongside IsrShrinksPerSec — sustained shrinks paired with expands indicate ISR instability

Note that UnderReplicatedPartitions is technically a per-broker metric, but it only becomes useful as a cluster-level aggregate. A URP count of zero on one broker says nothing about the remaining brokers. Sum it across the fleet.

Partition distribution

These metrics indicate whether work is balanced evenly across the broker fleet. Imbalances affect both throughput and fault tolerance.

Metric name JMX MBean path Description Target / alert threshold
PartitionCount kafka.server:type=ReplicaManager,name=PartitionCount Total partition replicas on this broker. Compare across brokers. Alert (warning) if the spread between the highest and lowest broker exceeds 20%
LeaderCount kafka.server:type=ReplicaManager,name=LeaderCount Number of partition leaders on this broker. Leaders handle all produce and fetch traffic for their partitions. Alert (warning) if leader distribution is significantly uneven across brokers

Uneven partition or leader counts after a broker restart or rolling upgrade are common and typically self-correct via the preferred leader election mechanism. Alert only on sustained imbalance.

Throughput and capacity signals

These metrics inform both capacity planning and short-term incident response.

Metric name JMX MBean path Description Target / alert threshold
BytesInPerSec kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec Bytes received per second per broker. Track per broker and as a cluster total. Alert (warning) when a broker’s throughput exceeds 70-80% of its network or disk write capacity
BytesOutPerSec kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec Bytes sent per second per broker. Includes replication traffic. Monitor alongside BytesInPerSec — replication multiplies egress by the replication factor
MessagesInPerSec kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec Messages received per second. Useful when message size varies across topics. Cross-check against BytesInPerSec

Cluster vs broker monitoring: where the line sits

The cluster and broker layers are not always cleanly separable. Several metrics appear in per-broker JMX output but only carry meaning when interpreted at the cluster level.

UnderReplicatedPartitions is the clearest example: it is exposed per broker, but aggregating it across the fleet gives you the total replication deficit. A single broker reporting zero URPs is not informative if another broker is lagging. ActiveControllerCount is only interpretable as a cluster-wide sum — a value of 1 on one broker is normal; a cluster-wide sum of 0 or 2 is critical. Individual broker throughput metrics (BytesInPerSec per broker) are per-broker values that feed cluster-level capacity planning by revealing which brokers are carrying disproportionate load.

For replication and controller metrics, always aggregate. For throughput, collect per-broker and compare across the fleet. The Kafka broker monitoring article covers the internals — request thread saturation, JVM heap, disk flush latency — that complement the cluster view covered here.

Multi-broker observability setup

The collection challenge

JMX is a per-process interface. To get a cluster-wide picture, you need to scrape every broker’s JMX endpoint and aggregate the results centrally. In a three-broker cluster this is manageable manually; in a fleet of dozens, service discovery and automated aggregation become essential.

The most common production approach is Prometheus with the JMX Exporter, because it handles service discovery natively and the aggregation logic lives in PromQL.

Prometheus and JMX Exporter

Run the JMX Exporter as an in-process Java agent on each broker. Running it as an agent rather than a remote poller avoids authentication overhead and reduces JVM thread overhead compared to the remote polling approach:

KAFKA_OPTS="-javaagent:/opt/prometheus/jmx_prometheus_javaagent-0.16.1.jar=7071:/etc/kafka/kafka-jmx-config.yml"

The agent exposes Kafka’s JMX MBeans as Prometheus-formatted metrics on an HTTP endpoint (typically port 7071). Configure Prometheus with a scrape job that discovers all broker endpoints. In Kubernetes, a ServiceMonitor resource handles this automatically; in bare-metal or VM deployments, use static targets in prometheus.yml.

Aggregation across brokers happens in PromQL. To get the cluster-wide under-replicated partition count:

sum(kafka_server_replicamanager_underreplicatedpartitions)

Kafka-specific JMX Exporter YAML configuration files are maintained by the community. The Bitnami and confluentinc examples on GitHub are widely used starting points and include pre-built allowlists for the metrics covered in this article.

Other collection approaches

Datadog: Uses JMXFetch via the Agent daemon. Autodiscovery maps JMX ports to integrations automatically. The default cap of 350 metrics per instance means you will need to configure which metrics are collected carefully for larger clusters.

Kpow (Factor House): Uses the native Kafka Admin Client and Consumer APIs rather than JMX — no sidecar agent or broker-side configuration changes required.

If you’re evaluating monitoring tools, the Kafka observability tools comparison article covers the trade-offs in more detail.

Cluster health check script

A short Python script that checks six cluster-level conditions gives you a pass/fail signal suitable for cron jobs, CI pipelines, or on-call runbooks. The value is in running it consistently on a schedule and routing its output to your alerting channel.

What the script checks

  1. ActiveControllerCount sum equals exactly 1
  2. OfflinePartitionsCount equals 0
  3. UnderReplicatedPartitions sum equals 0
  4. UnderMinIsrPartitionCount equals 0
  5. ISR shrink rate is below a configurable threshold
  6. Leader count distribution — no single broker holds more than a configurable percentage of all leaders

Implementation

The Kafka Admin Client (kafka-python or confluent-kafka) exposes cluster metadata directly without requiring JMX access. Use AdminClient.describe_cluster() to retrieve broker and controller state, and list_topics() with topic metadata to enumerate partition and leader assignments.

UnderReplicatedPartitions and UnderMinIsrPartitionCount are not exposed via the Admin API; for those, query the JMX Exporter HTTP endpoint if Prometheus is already deployed, or fall back to jmxterm.

`from kafka import KafkaAdminClient
import requests

BROKERS = [“broker1:9092”, “broker2:9092”, “broker3:9092”]
JMX_EXPORTER_HOSTS = [“broker1:7071”, “broker2:7071”, “broker3:7071”]
LEADER_SKEW_THRESHOLD = 0.20

def check_active_controller(admin):
   cluster_meta = admin.describe_cluster()
   controller = cluster_meta.controller
   if controller is None:
       return False, “No active controller”
   return True, f“Controller: broker {controller.id}”

def check_urp_from_jmx(hosts):
   total_urp = 0
   for host in hosts:
       try:
           r = requests.get(f“http://{host}/metrics”, timeout=5)
           for line in r.text.splitlines():
               if (
                   “kafka_server_replicamanager_underreplicatedpartitions” in line
                   and not line.startswith(“#”)
               ):
                   total_urp += float(line.split()[-1])
       except Exception as e:
           print(f“  Warning: could not reach {host}: {e}“)
   return total_urp

def check_leader_skew(admin):
   topics = admin.list_topics()
   leader_counts = {}
   for topic_metadata in topics.topics.values():
       for partition in topic_metadata.partitions.values():
           leader = partition.leader
           leader_counts[leader] = leader_counts.get(leader, 0) + 1
   if not leader_counts:
       return True, “No partitions”
   counts = list(leader_counts.values())
   skew = (max(counts) - min(counts)) / max(counts)
   ok = skew <= LEADER_SKEW_THRESHOLD
   return ok, f“Leader skew: {skew:.1%} (threshold {LEADER_SKEW_THRESHOLD:.0%})”

def run_health_check():
   admin = KafkaAdminClient(bootstrap_servers=BROKERS)
   results = []

   ok, msg = check_active_controller(admin)
   results.append((“ActiveControllerCount”, ok, msg))

   urp = check_urp_from_jmx(JMX_EXPORTER_HOSTS)
   results.append((“UnderReplicatedPartitions”, urp == 0, f“URP count: {int(urp)}”))

   ok, msg = check_leader_skew(admin)
   results.append((“LeaderSkew”, ok, msg))

   admin.close()

   print(“Kafka cluster health check”)
   all_ok = True
   for check, passed, detail in results:
       status = “PASS” if passed else “FAIL”
       print(f“  [{status}] {check}: {detail}“)
       if not passed:
           all_ok = False

   return all_ok

if name == “main”:
   import sys
   sys.exit(0 if run_health_check() else 1)`

Limitations

The consumer monitoring article contains a consumer lag script that complements this one. Together they cover the three main health dimensions: cluster-wide replication and control plane state (this article), per-broker internals (broker monitoring), and consumer group lag (consumer monitoring).

Capacity planning using cluster metrics

Capacity planning at the cluster level is about detecting trends early enough to act before they become incidents. The signals below are leading indicators — most warrant investigation and planning rather than immediate paging.

When to add brokers

The primary signals that a new broker is needed:

Partition count planning

Partitions are the unit of parallelism in Kafka. Too few limits throughput; too many increases controller overhead and replication cost.

Track total partition count and partition count per broker. The controller’s EventQueueSize rising during periods of high topic creation activity is a signal that the cluster is approaching the limits of what the controller can manage comfortably.

For older Kafka versions (pre-2.6), a commonly cited practical limit is approximately 4,000 partitions per broker, though this varies considerably with hardware and workload. With KRaft, the practical limit is significantly higher — the consolidated metadata management architecture removes the ZooKeeper bottleneck that was the primary constraint in earlier versions. Consult the release notes and benchmark reports for the specific version you are running.

Replication factor auditing

Topics created with replication.factor=1 in a multi-broker cluster represent a silent durability risk. A single broker failure takes those partitions offline with no replication fallback.

Audit replication factors on a schedule using kafka-topics.sh --describe or the Admin API:

`from kafka import KafkaAdminClient

admin = KafkaAdminClient(bootstrap_servers=[“broker1:9092”])
topic_names = list(admin.list_topics())
topics = admin.describe_topics(topic_names)

for t in topics:
   for partition in t.partitions.values():
       if len(partition.replicas) < 3:
           print(f“{t.topic} partition {partition.partition}: “
                 f“replication factor {len(partition.replicas)}”)

admin.close()`

Retention and storage forecasting

Monitor disk utilisation per broker and extrapolate the growth rate from the past 7 and 30 days. BytesInPerSec is the primary driver of storage growth once you account for the replication factor and compression codec.

If storage is growing faster than expected, the first lever to check is log retention settings — both time-based (retention.ms) and size-based (retention.bytes). Reducing retention decreases storage pressure but can break consumers that have fallen behind. Document the trade-off before changing retention on production topics.

Alerting strategy for cluster-level monitoring

Not all metrics warrant the same response. Structuring alerts into two tiers avoids alert fatigue and ensures the right response time for each condition.

Critical alerts (page on-call immediately)

Metric Condition Reason
ActiveControllerCount (sum) Not exactly 1 No controller or split-brain — metadata operations are halted
OfflinePartitionsCount > 0 Partitions are completely unavailable to producers and consumers
UnderMinIsrPartitionCount > 0 Producers with acks=all are receiving errors
UnderReplicatedPartitions > 0 for > 5 minutes Sustained replication deficit — data durability is degraded
UncleanLeaderElectionsPerSec > 0 A non-ISR replica was promoted to leader, indicating data loss risk

Working Prometheus alert rules for the two most critical conditions:

Warning alerts (alert team channel, investigate within the hour)

Metric Condition Reason
IsrShrinksPerSec Sustained above zero Replicas are falling behind; investigate before it becomes a URP
LeaderCount skew > 20% spread across brokers Producer and consumer traffic is unevenly distributed
BytesInPerSec per broker > 70% of capacity Approaching saturation on a burst
EventQueueSize > 100 Controller is backlogged; administrative operations will be slow

Alert inhibition

When a broker fails, it typically triggers a cascade: a critical broker-down alert alongside secondary consumer lag and replication warnings. Alertmanager’s inhibition rules let you suppress those secondary warnings while the root-cause alert is active, keeping the on-call view clean:

inhibit_rules:  - source_match:      alertname: 'KafkaOfflinePartitions'      severity: 'critical'    target_match:      severity: 'warning'    equal: ['cluster', 'instance']

Alert fatigue note: ISR shrinks are expected during rolling restarts and broker upgrades. Consider suppressing IsrShrinksPerSec warnings during maintenance windows, or requiring a minimum duration of 10 minutes before the alert fires.

Common Kafka cluster-level issues and how to resolve them

Symptom Likely metrics Root cause Remediation
Producers receiving NotEnoughReplicasException UnderMinIsrPartitionCount > 0, IsrShrinksPerSec elevated One or more brokers are offline or severely lagging, shrinking the ISR below min.insync.replicas Investigate the lagging broker: check disk, network, and JVM GC metrics. Restore the broker, or reduce min.insync.replicas temporarily as a last resort
Cluster metadata operations timing out EventQueueSize > 100, LeaderElectionRateAndTimeMs elevated Controller is overloaded — too many concurrent topic operations, or a large total partition count Reduce concurrent topic creation or deletion operations; review total partition count; consider a controller bounce if the queue does not drain
No active controller ActiveControllerCount sum = 0 All brokers have lost controller state — typically a ZooKeeper quorum failure (ZK mode) or KRaft quorum loss In ZK mode: investigate ZooKeeper ensemble health. In KRaft mode: check the metadata log and quorum state on controller-eligible brokers
Uneven consumer lag across partitions LeaderCount skew, BytesInPerSec per broker uneven Partition leader imbalance — one broker holds a disproportionate share of leaders Run preferred leader election (kafka-leader-election.sh --type preferred) or reassign partitions
Disk filling faster than expected BytesInPerSec elevated, disk utilisation trending up High produce rate, long retention, or insufficient compression Review retention settings, check compression codec, evaluate whether a broker addition is needed
ISR instability (repeated shrinks and expands) IsrShrinksPerSec and IsrExpandsPerSec both elevated Network instability between brokers, or JVM GC pauses causing follower replicas to miss heartbeats Check network error rates between brokers; review GC pause durations on the lagging broker; consider increasing replica.lag.time.max.ms if GC pauses are a one-time event

Best practices for Kafka cluster monitoring

Monitor Kafka clusters with Kpow

Kpow connects to any Kafka cluster using the native Admin Client and Consumer APIs — no JMX Exporter, no sidecar agent, and no broker-side configuration changes required. It surfaces the cluster-level metrics covered in this article — under-replicated partitions, controller state, partition distribution, throughput per broker — on a single dashboard, with replication health and partition distribution views that update in real time.

You can give Kpow a try with a free 30-day trial. Connect it to any Kafka cluster in minutes and deploy via Docker, Helm, or JAR.