Kafka cluster monitoring

factor-houseJune 4, 202617 min read

Kafka clusters are typically monitored from the broker up: process health checks, JVM metrics, disk alerts, per-broker request latencies. That instrumentation is necessary, but it has a blind spot. A broker can report normal metrics in isolation while the cluster as a whole is in a degraded state. Partition leadership can be unbalanced. Replication can be falling behind across the fleet. The controller can fail silently. None of these conditions are visible from a single broker’s perspective.

This article covers cluster-level monitoring: the aggregated, cross-broker view that reveals how the cluster is functioning as a unit. It covers the metrics that matter at this layer, how to collect them across a broker fleet, how to alert on them effectively, and how to diagnose the most common cluster-level failures.

Per-broker internals — request thread pools, JVM heap, disk I/O — are covered in the Kafka broker monitoring article. Consumer-side lag monitoring is covered in the Kafka consumer monitoring article. See the Kafka monitoring guide for the full picture.

Key takeaways

Cluster monitoring is different from broker monitoring: it gives you an aggregated view of replication health, partition distribution, and control plane state across all brokers.
ActiveControllerCount, OfflinePartitionsCount, and UnderReplicatedPartitions are the three cluster-level metrics that most directly signal data availability risk.
In KRaft mode (Kafka 3.3+ GA, default from 4.0), the control plane metrics change — ZooKeeper-era metrics are replaced by KRaft-native equivalents.
A cluster health check script that verifies five or six key metrics is usually enough to catch the most serious problems; the value is in running it consistently and alerting on deviations.
Capacity planning at the cluster level is about trend monitoring, not single-point thresholds — watch bytes-in per broker, disk utilisation rate, and partition count per broker over time.

What is Kafka cluster monitoring?

Cluster monitoring is the practice of observing the aggregated state of all brokers rather than the internals of any single one. The distinction matters in practice.

Broker monitoring tells you whether an individual broker is healthy: its request thread utilisation, JVM heap pressure, disk throughput, and local ISR state. Cluster monitoring tells you whether the cluster as a whole is functioning: whether replication is consistent across the fleet, whether partition leadership is balanced, whether the controller is in a valid state, and whether the cluster has capacity headroom.

Both layers are necessary. Cluster health checks can miss per-broker pathologies that are not yet visible at the aggregate level. A JVM GC pause on one broker, for example, may not immediately register as a replication deficit on the cluster health dashboard. Conversely, per-broker monitoring alone will not surface a partition imbalance, a controller failure, or a sustained ISR shrink rate that only becomes meaningful when measured across the fleet.

The same JMX interface exposes data at both levels. The difference is in how you query and aggregate it.

Key cluster-level metrics

Cluster-level metrics fall into two categories: metrics that are properties of the cluster as a whole (controller state, partition distribution), and metrics that are only meaningful when aggregated across brokers (replication health, throughput). The four groups below cover both.

Control plane health

The controller is the broker responsible for partition leader elections and cluster metadata management. There is always exactly one active controller in a functioning cluster. Deviations are critical.

Metric name	JMX MBean path	Description	Target / alert threshold
`ActiveControllerCount`	`kafka.controller:type=KafkaController,name=ActiveControllerCount`	Number of active controllers on this broker (0 or 1). Must sum to exactly 1 across all brokers.	Alert (critical) if cluster-wide sum is not exactly 1
`OfflinePartitionsCount`	`kafka.controller:type=KafkaController,name=OfflinePartitionsCount`	Number of partitions with no available leader. Reported by the active controller.	Alert (critical) if > 0
`LeaderElectionRateAndTimeMs`	`kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs`	Rate and duration of partition leader elections. Frequent or slow elections indicate controller instability or broker churn.	Alert (warning) if rate is sustained above baseline; alert (critical) if P99 duration is elevated
`EventQueueSize`	`kafka.controller:type=ControllerEventManager,name=EventQueueSize`	Pending administrative events queued on the controller.	Alert (warning) if > 100
`AvgIdleRatio`	`kafka.controller:type=ControllerEventManager,name=AvgIdleRatio`	Fraction of time the controller event thread is idle. A value approaching 0 indicates a saturated controller.	Alert (warning) if approaching 0

KRaft mode: In KRaft mode (Kafka 3.3+, default from 4.0), the following metrics replace ZooKeeper-era control plane telemetry. Verify MBean paths against the Kafka version in use.

Metric name	JMX MBean path	Description	Target / alert threshold
`LastAppliedRecordLagMs`	`kafka.controller:type=KafkaController,name=LastAppliedRecordLagMs`	Delay between metadata commits on the active controller and local application on standby controllers. Elevated values mean metadata changes are propagating slowly.	Alert (warning) if elevated and sustained
`MetadataErrorCount`	`kafka.controller:type=KafkaController,name=MetadataErrorCount`	Count of failed metadata log operations.	Alert (critical) if > 0
`current-state` (quorum state)	`kafka.server:type=raft-metrics,name=current-state`	State of the KRaft quorum member on this broker (Leader, Follower, Candidate).	Alert if controller broker is not Leader
`commit-latency-avg`	`kafka.server:type=raft-metrics,name=commit-latency-avg`	Average latency for KRaft metadata log commits.	Alert (warning) if elevated

A KRaft quorum of size n (typically an odd number: 3 or 5) requires a strict majority of floor(n/2) + 1 active nodes to elect a leader and accept metadata commits. If the active quorum drops below this threshold, the control plane enters a read-only mode. Brokers continue to process produce and consume requests for existing partitions, but topic creation, partition reassignment, and ISR state modifications are blocked. If a partition leader fails while the quorum is unavailable, client requests to that partition will time out.

Replication fleet health

These are the most operationally significant cluster-level metrics. They reflect whether data is being replicated as configured across all brokers, not just on one.

Metric name	JMX MBean path	Description	Target / alert threshold
`UnderReplicatedPartitions`	`kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`	Partitions where the current ISR count is less than the replication factor. Sum across all brokers for the cluster total.	Alert (critical) if sum > 0 for more than 5 minutes
`UnderMinIsrPartitionCount`	`kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount`	Partitions with fewer ISRs than `min.insync.replicas`. Producers configured with `acks=all` will receive errors.	Alert (critical) if > 0
`UncleanLeaderElectionsPerSec`	`kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec`	Rate of elections where a non-ISR replica was promoted to leader. Indicates data loss has occurred or is imminent.	Alert (critical) if > 0
`IsrShrinksPerSec`	`kafka.server:type=ReplicaManager,name=IsrShrinksPerSec`	Rate at which replicas are leaving the ISR. Track per broker and as a cluster sum.	Alert (warning) if sustained above zero
`IsrExpandsPerSec`	`kafka.server:type=ReplicaManager,name=IsrExpandsPerSec`	Rate at which replicas are rejoining the ISR.	Monitor alongside `IsrShrinksPerSec` — sustained shrinks paired with expands indicate ISR instability

Note that UnderReplicatedPartitions is technically a per-broker metric, but it only becomes useful as a cluster-level aggregate. A URP count of zero on one broker says nothing about the remaining brokers. Sum it across the fleet.

Partition distribution

These metrics indicate whether work is balanced evenly across the broker fleet. Imbalances affect both throughput and fault tolerance.

Metric name	JMX MBean path	Description	Target / alert threshold
`PartitionCount`	`kafka.server:type=ReplicaManager,name=PartitionCount`	Total partition replicas on this broker. Compare across brokers.	Alert (warning) if the spread between the highest and lowest broker exceeds 20%
`LeaderCount`	`kafka.server:type=ReplicaManager,name=LeaderCount`	Number of partition leaders on this broker. Leaders handle all produce and fetch traffic for their partitions.	Alert (warning) if leader distribution is significantly uneven across brokers

Uneven partition or leader counts after a broker restart or rolling upgrade are common and typically self-correct via the preferred leader election mechanism. Alert only on sustained imbalance.

Throughput and capacity signals

These metrics inform both capacity planning and short-term incident response.

Metric name	JMX MBean path	Description	Target / alert threshold
`BytesInPerSec`	`kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec`	Bytes received per second per broker. Track per broker and as a cluster total.	Alert (warning) when a broker’s throughput exceeds 70-80% of its network or disk write capacity
`BytesOutPerSec`	`kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec`	Bytes sent per second per broker. Includes replication traffic.	Monitor alongside `BytesInPerSec` — replication multiplies egress by the replication factor
`MessagesInPerSec`	`kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec`	Messages received per second. Useful when message size varies across topics.	Cross-check against `BytesInPerSec`

Cluster vs broker monitoring: where the line sits

The cluster and broker layers are not always cleanly separable. Several metrics appear in per-broker JMX output but only carry meaning when interpreted at the cluster level.

UnderReplicatedPartitions is the clearest example: it is exposed per broker, but aggregating it across the fleet gives you the total replication deficit. A single broker reporting zero URPs is not informative if another broker is lagging. ActiveControllerCount is only interpretable as a cluster-wide sum — a value of 1 on one broker is normal; a cluster-wide sum of 0 or 2 is critical. Individual broker throughput metrics (BytesInPerSec per broker) are per-broker values that feed cluster-level capacity planning by revealing which brokers are carrying disproportionate load.

For replication and controller metrics, always aggregate. For throughput, collect per-broker and compare across the fleet. The Kafka broker monitoring article covers the internals — request thread saturation, JVM heap, disk flush latency — that complement the cluster view covered here.

Multi-broker observability setup

The collection challenge

JMX is a per-process interface. To get a cluster-wide picture, you need to scrape every broker’s JMX endpoint and aggregate the results centrally. In a three-broker cluster this is manageable manually; in a fleet of dozens, service discovery and automated aggregation become essential.

The most common production approach is Prometheus with the JMX Exporter, because it handles service discovery natively and the aggregation logic lives in PromQL.

Prometheus and JMX Exporter

Run the JMX Exporter as an in-process Java agent on each broker. Running it as an agent rather than a remote poller avoids authentication overhead and reduces JVM thread overhead compared to the remote polling approach:

KAFKA_OPTS="-javaagent:/opt/prometheus/jmx_prometheus_javaagent-0.16.1.jar=7071:/etc/kafka/kafka-jmx-config.yml"

The agent exposes Kafka’s JMX MBeans as Prometheus-formatted metrics on an HTTP endpoint (typically port 7071). Configure Prometheus with a scrape job that discovers all broker endpoints. In Kubernetes, a ServiceMonitor resource handles this automatically; in bare-metal or VM deployments, use static targets in prometheus.yml.

Aggregation across brokers happens in PromQL. To get the cluster-wide under-replicated partition count:

sum(kafka_server_replicamanager_underreplicatedpartitions)

Kafka-specific JMX Exporter YAML configuration files are maintained by the community. The Bitnami and confluentinc examples on GitHub are widely used starting points and include pre-built allowlists for the metrics covered in this article.

Other collection approaches

Datadog: Uses JMXFetch via the Agent daemon. Autodiscovery maps JMX ports to integrations automatically. The default cap of 350 metrics per instance means you will need to configure which metrics are collected carefully for larger clusters.

Kpow (Factor House): Uses the native Kafka Admin Client and Consumer APIs rather than JMX — no sidecar agent or broker-side configuration changes required.

If you’re evaluating monitoring tools, the Kafka observability tools comparison article covers the trade-offs in more detail.

Cluster health check script

A short Python script that checks six cluster-level conditions gives you a pass/fail signal suitable for cron jobs, CI pipelines, or on-call runbooks. The value is in running it consistently on a schedule and routing its output to your alerting channel.

What the script checks

ActiveControllerCount sum equals exactly 1
OfflinePartitionsCount equals 0
UnderReplicatedPartitions sum equals 0
UnderMinIsrPartitionCount equals 0
ISR shrink rate is below a configurable threshold
Leader count distribution — no single broker holds more than a configurable percentage of all leaders

Implementation

The Kafka Admin Client (kafka-python or confluent-kafka) exposes cluster metadata directly without requiring JMX access. Use AdminClient.describe_cluster() to retrieve broker and controller state, and list_topics() with topic metadata to enumerate partition and leader assignments.

UnderReplicatedPartitions and UnderMinIsrPartitionCount are not exposed via the Admin API; for those, query the JMX Exporter HTTP endpoint if Prometheus is already deployed, or fall back to jmxterm.

`from kafka import KafkaAdminClient
import requests

BROKERS = [“broker1:9092”, “broker2:9092”, “broker3:9092”]
JMX_EXPORTER_HOSTS = [“broker1:7071”, “broker2:7071”, “broker3:7071”]
LEADER_SKEW_THRESHOLD = 0.20

def check_active_controller(admin):
cluster_meta = admin.describe_cluster()
controller = cluster_meta.controller
if controller is None:
return False, “No active controller”
return True, f“Controller: broker {controller.id}”

def check_urp_from_jmx(hosts):
total_urp = 0
for host in hosts:
try:
r = requests.get(f“http://{host}/metrics”, timeout=5)
for line in r.text.splitlines():
if (
“kafka_server_replicamanager_underreplicatedpartitions” in line
and not line.startswith(“#”)
):
total_urp += float(line.split()[-1])
except Exception as e:
print(f“ Warning: could not reach {host}: {e}“)
return total_urp

def check_leader_skew(admin):
topics = admin.list_topics()
leader_counts = {}
for topic_metadata in topics.topics.values():
for partition in topic_metadata.partitions.values():
leader = partition.leader
leader_counts[leader] = leader_counts.get(leader, 0) + 1
if not leader_counts:
return True, “No partitions”
counts = list(leader_counts.values())
skew = (max(counts) - min(counts)) / max(counts)
ok = skew <= LEADER_SKEW_THRESHOLD
return ok, f“Leader skew: {skew:.1%} (threshold {LEADER_SKEW_THRESHOLD:.0%})”

def run_health_check():
admin = KafkaAdminClient(bootstrap_servers=BROKERS)
results = []

ok, msg = check_active_controller(admin)
results.append((“ActiveControllerCount”, ok, msg))

urp = check_urp_from_jmx(JMX_EXPORTER_HOSTS)
results.append((“UnderReplicatedPartitions”, urp == 0, f“URP count: {int(urp)}”))

ok, msg = check_leader_skew(admin)
results.append((“LeaderSkew”, ok, msg))

admin.close()

print(“Kafka cluster health check”)
all_ok = True
for check, passed, detail in results:
status = “PASS” if passed else “FAIL”
print(f“ [{status}] {check}: {detail}“)
if not passed:
all_ok = False

return all_ok

if name == “main”:
import sys
sys.exit(0 if run_health_check() else 1)`

Limitations

The script reflects point-in-time state. A transient ISR shrink during a rolling restart will register as a failure unless you add a wait-and-recheck delay for replication metrics.
It does not replace continuous time-series monitoring — it is a runbook tool, not a substitute for alerting.
JVM and OS metrics are not covered here; those are addressed in the broker monitoring article.

The consumer monitoring article contains a consumer lag script that complements this one. Together they cover the three main health dimensions: cluster-wide replication and control plane state (this article), per-broker internals (broker monitoring), and consumer group lag (consumer monitoring).

Capacity planning using cluster metrics

Capacity planning at the cluster level is about detecting trends early enough to act before they become incidents. The signals below are leading indicators — most warrant investigation and planning rather than immediate paging.

When to add brokers

The primary signals that a new broker is needed:

BytesInPerSec per broker is consistently above 70% of the broker’s network or disk write capacity. A useful rule of thumb is to size for 3x peak traffic to allow for replication overhead and burst headroom. Replication multiplies egress by the replication factor: a cluster with a replication factor of 3 and 1 Gbps of ingest generates roughly 3 Gbps of outbound replication traffic.
RequestHandlerAvgIdlePercent is below 20% on multiple brokers over a sustained period, indicating that the request handler thread pool is saturated. This is typically caused by slow disk flushes, high JVM GC pauses, or high concurrent replication load from lagging followers.
Disk utilisation is growing at a rate that will exhaust available space within your retention window, with no further compression or retention policy levers available.
PartitionCount per broker is becoming significantly uneven after broker additions or failures — partition reassignment may be needed before adding a new broker.

Partition count planning

Partitions are the unit of parallelism in Kafka. Too few limits throughput; too many increases controller overhead and replication cost.

Track total partition count and partition count per broker. The controller’s EventQueueSize rising during periods of high topic creation activity is a signal that the cluster is approaching the limits of what the controller can manage comfortably.

For older Kafka versions (pre-2.6), a commonly cited practical limit is approximately 4,000 partitions per broker, though this varies considerably with hardware and workload. With KRaft, the practical limit is significantly higher — the consolidated metadata management architecture removes the ZooKeeper bottleneck that was the primary constraint in earlier versions. Consult the release notes and benchmark reports for the specific version you are running.

Replication factor auditing

Topics created with replication.factor=1 in a multi-broker cluster represent a silent durability risk. A single broker failure takes those partitions offline with no replication fallback.

Audit replication factors on a schedule using kafka-topics.sh --describe or the Admin API:

`from kafka import KafkaAdminClient

admin = KafkaAdminClient(bootstrap_servers=[“broker1:9092”])
topic_names = list(admin.list_topics())
topics = admin.describe_topics(topic_names)

for t in topics:
for partition in t.partitions.values():
if len(partition.replicas) < 3:
print(f“{t.topic} partition {partition.partition}: “
f“replication factor {len(partition.replicas)}”)

admin.close()`

Retention and storage forecasting

Monitor disk utilisation per broker and extrapolate the growth rate from the past 7 and 30 days. BytesInPerSec is the primary driver of storage growth once you account for the replication factor and compression codec.

If storage is growing faster than expected, the first lever to check is log retention settings — both time-based (retention.ms) and size-based (retention.bytes). Reducing retention decreases storage pressure but can break consumers that have fallen behind. Document the trade-off before changing retention on production topics.

Alerting strategy for cluster-level monitoring

Not all metrics warrant the same response. Structuring alerts into two tiers avoids alert fatigue and ensures the right response time for each condition.

Critical alerts (page on-call immediately)

Metric	Condition	Reason
`ActiveControllerCount` (sum)	Not exactly 1	No controller or split-brain — metadata operations are halted
`OfflinePartitionsCount`	> 0	Partitions are completely unavailable to producers and consumers
`UnderMinIsrPartitionCount`	> 0	Producers with `acks=all` are receiving errors
`UnderReplicatedPartitions`	> 0 for > 5 minutes	Sustained replication deficit — data durability is degraded
`UncleanLeaderElectionsPerSec`	> 0	A non-ISR replica was promoted to leader, indicating data loss risk

Working Prometheus alert rules for the two most critical conditions:

Warning alerts (alert team channel, investigate within the hour)

Metric	Condition	Reason
`IsrShrinksPerSec`	Sustained above zero	Replicas are falling behind; investigate before it becomes a URP
`LeaderCount` skew	> 20% spread across brokers	Producer and consumer traffic is unevenly distributed
`BytesInPerSec` per broker	> 70% of capacity	Approaching saturation on a burst
`EventQueueSize`	> 100	Controller is backlogged; administrative operations will be slow

Alert inhibition

When a broker fails, it typically triggers a cascade: a critical broker-down alert alongside secondary consumer lag and replication warnings. Alertmanager’s inhibition rules let you suppress those secondary warnings while the root-cause alert is active, keeping the on-call view clean:

inhibit_rules: - source_match: alertname: 'KafkaOfflinePartitions' severity: 'critical' target_match: severity: 'warning' equal: ['cluster', 'instance']

Alert fatigue note: ISR shrinks are expected during rolling restarts and broker upgrades. Consider suppressing IsrShrinksPerSec warnings during maintenance windows, or requiring a minimum duration of 10 minutes before the alert fires.

Common Kafka cluster-level issues and how to resolve them

Symptom	Likely metrics	Root cause	Remediation
Producers receiving `NotEnoughReplicasException`	`UnderMinIsrPartitionCount > 0`, `IsrShrinksPerSec` elevated	One or more brokers are offline or severely lagging, shrinking the ISR below `min.insync.replicas`	Investigate the lagging broker: check disk, network, and JVM GC metrics. Restore the broker, or reduce `min.insync.replicas` temporarily as a last resort
Cluster metadata operations timing out	`EventQueueSize > 100`, `LeaderElectionRateAndTimeMs` elevated	Controller is overloaded — too many concurrent topic operations, or a large total partition count	Reduce concurrent topic creation or deletion operations; review total partition count; consider a controller bounce if the queue does not drain
No active controller	`ActiveControllerCount` sum = 0	All brokers have lost controller state — typically a ZooKeeper quorum failure (ZK mode) or KRaft quorum loss	In ZK mode: investigate ZooKeeper ensemble health. In KRaft mode: check the metadata log and quorum state on controller-eligible brokers
Uneven consumer lag across partitions	`LeaderCount` skew, `BytesInPerSec` per broker uneven	Partition leader imbalance — one broker holds a disproportionate share of leaders	Run preferred leader election (`kafka-leader-election.sh --type preferred`) or reassign partitions
Disk filling faster than expected	`BytesInPerSec` elevated, disk utilisation trending up	High produce rate, long retention, or insufficient compression	Review retention settings, check compression codec, evaluate whether a broker addition is needed
ISR instability (repeated shrinks and expands)	`IsrShrinksPerSec` and `IsrExpandsPerSec` both elevated	Network instability between brokers, or JVM GC pauses causing follower replicas to miss heartbeats	Check network error rates between brokers; review GC pause durations on the lagging broker; consider increasing `replica.lag.time.max.ms` if GC pauses are a one-time event

Best practices for Kafka cluster monitoring

Monitor ActiveControllerCount as a cluster-wide sum, not per broker. A per-broker reading of 1 is normal; only the sum tells you whether the cluster has exactly one controller.
Treat UnderReplicatedPartitions as a cluster-level aggregate. Sum it across all brokers. A URP count of zero on one broker says nothing if another broker is lagging.
Separate replication traffic from consumer traffic in your throughput metrics. If BytesOutPerSec includes both, you cannot tell whether a spike is from a consumer or from replication catch-up.
Set alert durations on replication metrics, not just thresholds. An ISR shrink during a rolling restart is expected; one that persists for five minutes is not.
Audit replication factors on a schedule. Topics created with non-standard replication factors are common after high-velocity development cycles and represent silent durability risk.
In KRaft mode, add LastAppliedRecordLagMs and MetadataErrorCount to your standard cluster health dashboard. These have no ZooKeeper equivalents and are easy to overlook when migrating an existing monitoring setup.
Run a cluster health check script on a cron schedule — every five minutes works well — and route its output to your alerting channel. It catches conditions that continuous metric alerting can miss during gaps in scrape coverage.

Monitor Kafka clusters with Kpow

Kpow connects to any Kafka cluster using the native Admin Client and Consumer APIs — no JMX Exporter, no sidecar agent, and no broker-side configuration changes required. It surfaces the cluster-level metrics covered in this article — under-replicated partitions, controller state, partition distribution, throughput per broker — on a single dashboard, with replication health and partition distribution views that update in real time.

You can give Kpow a try with a free 30-day trial. Connect it to any Kafka cluster in minutes and deploy via Docker, Helm, or JAR.