Kafka broker monitoring

factor-houseJune 4, 202616 min read

Brokers are the operational core of a Kafka cluster. They receive produce requests, serve fetch requests, manage partition replicas, and coordinate leader election. When a broker degrades, every producer and consumer connected to it is affected, often before your monitoring stack surfaces anything useful.

This article covers broker-level monitoring specifically: the JMX metrics to watch, thresholds to alert on, process monitoring scripts, and how to diagnose the most common failure modes. For cluster-wide concerns, consumer lag, and producer delivery guarantees, refer to the separate guides on Kafka cluster monitoring, Kafka consumer monitoring, and Kafka producer monitoring. For a full overview of Kafka observability, see the Kafka monitoring guide.

Key takeaways

Kafka brokers expose metrics via JMX by default; Prometheus users need the JMX Exporter agent to scrape them.
UnderReplicatedPartitions and ActiveControllerCount are the two most critical metrics for cluster health.
Broker monitoring covers three layers: the Kafka process, the JVM, and the host OS. All three require attention.
A lightweight process monitoring script can detect broker failures faster than most monitoring stacks.
Most broker incidents trace to a small set of root causes: disk saturation, leader imbalance, network thread exhaustion, or ISR instability.

What is Kafka broker monitoring?

A Kafka broker handles several concurrent responsibilities: writing incoming messages to disk, serving read requests from consumers, replicating partition data to follower brokers, and participating in leader election. Each of these activities has distinct failure modes.

Broker health is not the same as cluster health. A single degraded broker can produce replica lag, leader imbalance, and consumer fetch failures without the cluster appearing unavailable. Topics remain accessible on unaffected brokers, but the partitions whose leaders or followers are on the degraded broker will silently degrade. Identifying which broker is under pressure, and why, is the purpose of broker-level monitoring.

Monitoring spans three layers:

Kafka process metrics (exposed via JMX): replication state, request throughput, controller status, failure rates.
JVM metrics: heap usage, garbage collection pause duration, open file descriptors.
Host OS metrics: disk capacity, disk I/O throughput, network utilization.

Consumer lag, producer delivery guarantees, and cluster-wide partition distribution are covered in separate articles. This article focuses on the broker process and what runs immediately around it.

How Kafka brokers expose metrics

JMX (default)

Kafka exposes metrics as JMX MBeans by default. Server-side broker metrics use Yammer Metrics internally; native Java clients use Kafka’s own metrics registry. Both project their measurements onto MBeans hosted in the JVM’s MBean server.

Remote JMX access is disabled by default. To enable it, set the JMX_PORT environment variable when starting the broker:

export JMX_PORT=9999

In production, raw JMX access is a security risk. Because JMX supports remote method invocation (RMI), an unauthenticated client can invoke operations on MBeans, including modifying runtime configurations and triggering JVM shutdowns. Use KAFKA_JMX_OPTS to enforce authentication and encryption:

export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.authenticate=true \ -Dcom.sun.management.jmxremote.ssl=true \ -Dcom.sun.management.jmxremote.port=9999 \ -Dcom.sun.management.jmxremote.rmi.port=9999 \ -Dcom.sun.management.jmxremote.password.file=/etc/kafka/jmxremote.password \ -Dcom.sun.management.jmxremote.access.file=/etc/kafka/jmxremote.access"

Tools such as jconsole, jmxterm, and kafka-run-class.sh kafka.tools.JmxTool can query JMX directly once access is enabled.

Prometheus via JMX Exporter

The standard approach for Prometheus-based stacks is to run the Prometheus JMX Exporter as a Java agent alongside the broker JVM.

Running the exporter in local agent mode (rather than as a standalone HTTP server) is preferred. It avoids the serialization overhead of remote RMI polling and captures additional host-level process metrics including JVM CPU and memory utilization.

The agent translates JMX MBeans to Prometheus metrics and serves them on an HTTP endpoint, typically on port 7071. It requires a YAML configuration file that maps MBean paths to Prometheus metric names. A community-maintained configuration for Kafka is available in the JMX Exporter GitHub repository. Keep this configuration file in version control alongside your broker configuration and review it when upgrading Kafka versions.

Kafka metrics reporters (pluggable)

Kafka supports custom MetricsReporter implementations via the metric.reporters setting in server.properties. This allows metrics to be pushed directly to external systems without going through JMX or Prometheus. Confluent Platform ships a reporter that sends metrics to Confluent Control Center. Vendors such as Datadog and New Relic provide their own reporter implementations as well.

KRaft mode: In KRaft mode (GA since Kafka 3.3), some ZooKeeper-related MBeans are removed and several KRaft-specific MBeans appear in their place, including FencedBrokerCount and LastAppliedRecordLagMs. If you are migrating from ZooKeeper mode, or if you have inherited dashboards built against an older cluster, audit your MBean paths before deploying them against a KRaft cluster.

Key metrics to monitor

The sections below cover metrics across all three layers. JMX MBean paths are given for direct JMX access; if you are using Prometheus, metric names follow the pattern produced by the JMX Exporter configuration.

Replication health metrics

These are the highest-priority metrics on any broker. A non-zero value in the first three rows means data availability is at risk.

Metric name	JMX MBean path	Description	Why it matters
`UnderReplicatedPartitions`	`kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`	Partitions where the ISR count is below the replication factor	A sustained non-zero value means the cluster has fewer replicas than configured to fall back on if a leader fails
`UnderMinIsrPartitionCount`	`kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount`	Partitions with fewer in-sync replicas than `min.insync.replicas`	Producers configured with `acks=all` receive `NotEnoughReplicasException` when this is non-zero
`AtMinIsrPartitionCount`	`kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount`	Partitions where ISR count equals `min.insync.replicas` exactly	One more broker going offline pushes these into the under-min state; treat as a warning signal
`IsrShrinksPerSec`	`kafka.server:type=ReplicaManager,name=IsrShrinksPerSec`	Rate at which replicas are leaving the ISR	Sustained shrinks indicate follower replicas are falling behind, typically from disk pressure or GC pauses
`IsrExpandsPerSec`	`kafka.server:type=ReplicaManager,name=IsrExpandsPerSec`	Rate at which replicas are rejoining the ISR	High expand rate alongside shrinks indicates ISR churn rather than a clean recovery

Controller metrics

There is always exactly one active controller in a Kafka cluster. These metrics confirm that invariant holds and that the controller is performing well.

Metric name	JMX MBean path	Description	Why it matters
`ActiveControllerCount`	`kafka.controller:type=KafkaController,name=ActiveControllerCount`	Whether this broker is currently the active controller (0 or 1)	The cluster-wide sum must be exactly 1. Zero means no metadata management; more than 1 indicates a split-brain condition
`OfflinePartitionsCount`	`kafka.controller:type=KafkaController,name=OfflinePartitionsCount`	Partitions with no available leader	Non-zero means those partitions are completely unavailable to producers and consumers
`LeaderElectionRateAndTimeMs`	`kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs`	Frequency and duration of leader elections	Frequent elections trigger consumer rebalances and producer retries; elevated duration suggests controller pressure
`FencedBrokerCount`	`kafka.controller:type=KafkaController,name=FencedBrokerCount`	Brokers the controller has marked as unreachable (KRaft only)	Any fenced broker represents lost partition leadership and requires immediate investigation
`LastAppliedRecordLagMs`	`kafka.controller:type=KafkaController,name=LastAppliedRecordLagMs`	Metadata replication lag relative to the active controller (KRaft only)	Should be 0 on the active controller; values above 1,000ms on standby nodes indicate metadata replication delay

Request handling metrics

These metrics indicate whether the broker is keeping up with its request load.

Metric name	JMX MBean path	Description	Why it matters
`RequestHandlerAvgIdlePercent`	`kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent`	Average idle percentage of the request handler thread pool	Values below 0.20 indicate saturation on request processing
`NetworkProcessorAvgIdlePercent`	`kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`	Average idle percentage of the network processor threads	Target above 0.30; below 0.20 indicates network thread exhaustion, often caused by slow consumers
`TotalTimeMs` (Produce)	`kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce`	End-to-end latency for produce requests	P99 target is under 100ms; sustained spikes indicate broker-side pressure or slow disk
`TotalTimeMs` (FetchConsumer)	`kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer`	End-to-end latency for consumer fetch requests	Sustained elevation traces to replica lag or disk read pressure
`RequestQueueSize`	`kafka.network:type=RequestChannel,name=RequestQueueSize`	Requests waiting for a handler thread	Should remain below 10; sustained growth indicates handler thread saturation
`FailedProduceRequestsPerSec`	`kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec`	Rate of failed produce requests	Any non-zero value requires investigation
`FailedFetchRequestsPerSec`	`kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec`	Rate of failed fetch requests	Any non-zero value requires investigation

Latency sub-phases. TotalTimeMs is the sum of five phases: time waiting in the request queue (RequestQueueTimeMs), local processing time on the partition leader (LocalTimeMs), replication wait time for acks=all producers (RemoteTimeMs), time waiting in the response queue (ResponseQueueTimeMs), and time to transmit the response to the client (ResponseSendTimeMs). When P99 TotalTimeMs is elevated, checking each sub-phase narrows the root cause. High RemoteTimeMs points to replication pressure. High LocalTimeMs typically indicates disk write saturation or message format conversion overhead.

Throughput and I/O metrics

These metrics give a baseline view of broker load and are useful for capacity planning.

Metric name	JMX MBean path	Description	Why it matters
`BytesInPerSec`	`kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec`	Bytes received per second	Sudden spikes can saturate disk write bandwidth or network interfaces
`BytesOutPerSec`	`kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec`	Bytes sent per second	Includes replication traffic; track separately from consumer traffic where possible
`MessagesInPerSec`	`kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec`	Messages received per second	Useful for understanding message rate independently of message size

Log and disk metrics

Metric name	JMX MBean path	Description	Why it matters
`OfflineLogDirectoryCount`	`kafka.log:type=LogManager,name=OfflineLogDirectoryCount`	Log directories that are offline	Any non-zero value means the broker cannot write to some partitions
`DeadThreadCount`	`kafka.log:type=LogCleaner,name=DeadThreadCount`	Log cleaner threads that have failed silently	Non-zero means compaction has stopped; `uncleanable-bytes` will grow until disk fills
`uncleanable-bytes`	`kafka.log:type=LogCleaner,name=uncleanable-bytes`	Data volume waiting to be compacted	A persistent upward trend on compacted topics indicates log cleaner failure
Disk space (OS)	n/a – monitor via node exporter or equivalent	Percentage of disk used on the Kafka log directory	Kafka stops accepting produce requests when disk fills completely

JVM metrics

JVM pressure is a common root cause of ISR instability and request latency spikes. If a broker’s JVM pauses long enough during garbage collection, the broker fails to send heartbeats to the controller. This triggers a timeout and forces partition leadership reassignment, which looks from the outside like an ISR shrink or elevated election rate.

Metric name	Source	Description	Why it matters
JVM heap used	`java.lang:type=Memory`	Heap in use as a percentage of max	Sustained above 75% increases GC pressure and pause frequency
GC pause duration	`java.lang:type=GarbageCollector`	Duration of stop-the-world GC events	Pauses above approximately 1 second cause ISR shrinks and heartbeat timeouts; the G1GC target is 20ms
GC frequency	`java.lang:type=GarbageCollector`	Number of GC events per minute	High frequency of short pauses often precedes a longer pause event
Open file descriptors	`java.lang:type=OperatingSystem`	Number of open file handles	Kafka opens many handles for log segments; hitting the OS limit causes broker errors

Heap sizing. Kafka uses a hybrid memory model: a small JVM heap handles partition metadata, indexes, and producer state, while the OS page cache holds hot log segment data. Allocating too much to the JVM heap reduces the page cache, which forces consumer reads to disk. On a 64 GB host, a typical configuration is 6-12 GB for the JVM heap and the remainder for the OS page cache. A rough rule of thumb is 1-2 MB of heap per active partition replica hosted on the broker.

Broker process monitoring script

The JMX metrics above require the broker to be running and reachable. A process-level check catches failures that happen before or outside the metrics stack: a broker crash, a port that is not listening, a startup failure, or a JVM that is hung during initialization. The scripts below complement JMX monitoring rather than replacing it.

Check the broker process is running (jps / ps)

jps (JVM process status, included in the JDK) lists running JVM processes by main class. For Apache Kafka, the main class is kafka.Kafka.

`#!/bin/bash
BROKER_CLASS=“kafka.Kafka”

if jps -l | grep -q “${BROKER_CLASS}”; then
echo “Broker process is running”
exit 0
else
echo “ERROR: Kafka broker process not found”
exit 1
fi`

If jps is not available, fall back to ps:

#!/bin/bash if ps aux | grep -q '[k]afka.Kafka'; then echo "Broker process is running" exit 0 else echo "ERROR: Kafka broker process not found" exit 1 fi

Confluent Platform: The main class for Confluent Server is io.confluent.kafka.server.ConfluentServer. Adjust the grep pattern accordingly.

A running process does not mean the broker is healthy. It may be in an error state or hung waiting on disk or network. Use this check as a first-line detector, not a health indicator.

Check the broker port is accepting connections

A process check confirms the JVM is alive, but not that it is accepting Kafka connections. Check that the listener port (default 9092) is open and accepting connections:

`#!/bin/bash
BROKER_HOST=“localhost”
BROKER_PORT=9092
TIMEOUT=5

if nc -z -w “${TIMEOUT}” “${BROKER_HOST}” “${BROKER_PORT}” 2>/dev/null; then
echo “Broker port ${BROKER_PORT} is accepting connections”
exit 0
else
echo “ERROR: Broker not accepting connections on ${BROKER_HOST}:${BROKER_PORT}”
exit 1
fi`

In environments without netcat, use the /dev/tcp bash built-in:

`#!/bin/bash
BROKER_HOST=“localhost”
BROKER_PORT=9092

if (echo > /dev/tcp/${BROKER_HOST}/${BROKER_PORT}) 2>/dev/null; then
echo “Broker port ${BROKER_PORT} is accepting connections”
exit 0
else
echo “ERROR: Broker not accepting connections on ${BROKER_HOST}:${BROKER_PORT}”
exit 1
fi`

JMX-based health check script

A port check validates TCP connectivity but not broker health. Use kafka-run-class.sh with kafka.tools.JmxTool to query UnderReplicatedPartitions directly. This works without additional tooling if the JDK and Kafka binaries are available.

`#!/bin/bash
JMX_HOST=“localhost”
JMX_PORT=9999
MBEAN=“kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions”

result=$(kafka-run-class.sh kafka.tools.JmxTool \
–jmx-url “service:jmx:rmi:///jndi/rmi://${JMX_HOST}:${JMX_PORT}/jmxrmi” \
–object-name “${MBEAN}” \
–attributes Value \
–one-time true 2>/dev/null | tail -n 1 | awk -F’,’ ‘{print $NF}’)

if [ -z “${result}” ]; then
echo “ERROR: Could not retrieve JMX metric”
exit 2
fi

if [ “${result}” -gt 0 ]; then
echo “WARNING: UnderReplicatedPartitions = ${result}”
exit 1
else
echo “OK: UnderReplicatedPartitions = 0”
exit 0
fi`

This gives a more meaningful health signal than a port check. A broker can accept TCP connections while being in a degraded replication state.

Kafka Admin API check

kafka-broker-api-versions.sh is a lightweight liveness check that validates the Kafka protocol layer, not just TCP connectivity:

kafka-broker-api-versions.sh --bootstrap-server localhost:9092

If the broker is healthy, this returns the list of supported API versions. If the protocol handshake fails or the broker is not responsive, it exits with an error. Use this as a quick manual check or wrap it in a script for automated monitoring. It is a useful final check after a rolling restart to confirm each broker has rejoined the cluster before proceeding to the next.

Broker monitoring tools

Prometheus and Grafana

The most common self-hosted stack. The Prometheus JMX Exporter scrapes broker JMX metrics, Prometheus stores them as time-series data, and Grafana renders dashboards. Community dashboards for Kafka are available on Grafana Labs, though quality and metric coverage vary. Maintaining a complete and accurate Kafka dashboard requires ongoing attention as cluster topology and Kafka versions change.

Managed monitoring platforms (Datadog, New Relic)

Agent-based collection where vendor agents handle JMX scraping and forwarding. Both Datadog and New Relic provide out-of-the-box Kafka dashboards and alert policies, which reduces setup time compared to the self-hosted Prometheus stack. Cost scales with host count and metrics volume. Default alert thresholds may need adjustment for high-throughput deployments.

Confluent Control Center

Available in Confluent Platform, not Apache Kafka. Provides deep integration with Confluent-specific metrics and a built-in interface for broker health, topic management, and consumer lag. Less useful if you are running vanilla Apache Kafka.

Kpow by Factor House

Kpow is purpose-built for Kafka observability. It surfaces broker health, partition state, ISR status, and throughput metrics without requiring a separate Prometheus stack or Grafana instance, and it runs inside your own network. Try it free for 30 days.

Alerting strategy for broker monitoring

The goal is to page on-call when there is an active data risk and route everything else to a dashboard or asynchronous channel. Static thresholds on percentage-based metrics can generate false alarms during quiet periods: a single failed request in a low-traffic window can push an error rate to 50% and trigger a high-severity alert despite having no real impact. Organize alerts by severity:

Priority	Area	Metrics	Check interval	Route
P0	Immediate cluster health	`OfflinePartitionsCount`, `ActiveControllerCount`	15 seconds	On-call pager
P1	Data safety	`UnderReplicatedPartitions`, `IsrShrinksPerSec`, `FencedBrokerCount`	30 seconds	High-priority chat or pager
P2	Performance signals	`TotalTimeMs` P99, broker throughput, thread idle percentages	1 minute	Slack channel
P3	Capacity planning	CPU, memory pools, disk usage	5 minutes	Email or ticket
P4	Deep diagnostics	Per-partition metrics, thread dumps	As needed	Dashboard only

Recommended thresholds based on the Apache Kafka documentation and operational guidance:

Metric	Warning threshold	Critical threshold	Notes
`UnderReplicatedPartitions`	> 0 for > 2 minutes	> 0 for > 5 minutes	Short-lived spikes during rolling restarts are expected
`ActiveControllerCount` (cluster sum)	!= 1	–	Any deviation is immediately critical; no grace period
`OfflinePartitionsCount`	> 0	> 0	Alert immediately; affected partitions are unavailable
`RequestHandlerAvgIdlePercent`	< 0.30	< 0.20	Evaluate on a sustained basis, not momentary spikes
`NetworkProcessorAvgIdlePercent`	< 0.30	< 0.20	Same as above
Disk usage	> 70%	> 85%	Kafka stops accepting produce requests when the log directory fills completely
GC pause duration	> 500ms	> 1 second	Stop-the-world pauses above 1 second cause ISR shrinks and heartbeat timeouts

Link every production alert to a runbook. When an alert fires, the on-call engineer should be able to identify the root cause and begin remediation without first looking up what the metric means. A useful runbook includes the metric’s JMX MBean source, related broker log entries to check, and step-by-step remediation for the most common causes of that alert.

Common broker issues and how to diagnose them

Symptom	Relevant metrics	Likely root cause	Diagnosis and remediation
`UnderReplicatedPartitions` rising	`UnderReplicatedPartitions`, `IsrShrinksPerSec`	Follower broker is slow or unreachable	Check GC pause times and network connectivity on the lagging follower; check disk I/O on the follower
Produce latency spikes	`TotalTimeMs` (Produce), `LocalTimeMs`, disk I/O	Disk write saturation	Decompose `TotalTimeMs` into sub-phases; check disk throughput; consider partition reassignment to less-loaded brokers
`OfflinePartitions` suddenly non-zero	`OfflinePartitionsCount`, `ActiveControllerCount`	Leader broker crashed with no available replicas	Check broker startup logs; if replication factor was 1, data may be lost
Network thread exhaustion	`NetworkProcessorAvgIdlePercent`	Too many slow consumers or large fetch sizes	Increase `num.network.threads`; investigate slow consumers; consider reducing fetch size on affected consumers
Request handler exhaustion	`RequestHandlerAvgIdlePercent`	Too many concurrent requests or heavy compression	Increase `num.io.threads`; check for compression-related CPU spikes
Broker JVM out of memory	JVM heap, GC events	Heap too small for partition count	Increase broker heap (`-Xmx`); check for large message batches; review partition replica count on that broker
Broker failing to start	`OfflineLogDirectoryCount`	Corrupt log directory or disk failure	Check broker startup logs; run `kafka-log-dirs.sh` to identify the offline directory
ISR churning (shrinks and expands)	`IsrShrinksPerSec`, `IsrExpandsPerSec`	GC pauses or network jitter causing replicas to briefly fall behind	Check GC logs; consider tuning `replica.lag.time.max.ms`; verify network stability between brokers
Disk space growing unexpectedly	`uncleanable-bytes`, `DeadThreadCount`	Log cleaner thread has failed silently	Check whether `DeadThreadCount` is greater than 0; if log cleaner threads have died, a broker restart is required

Kafka broker monitoring best practices

Monitor all three layers: Kafka process metrics, JVM metrics, and host OS metrics. A problem at any layer will eventually surface in the others.
Alert on UnderReplicatedPartitions and ActiveControllerCount before everything else. These are the leading indicators of cluster health degradation.
Set disk usage alerts at 70% and 85%. Kafka stops accepting produce requests when the log directory fills completely and does not degrade gracefully before that threshold.
Use a lightweight process check script as a first-line detector. It can catch broker failures before your metrics pipeline does, particularly in the window between a crash and the first missed metrics scrape.
Keep your JMX Exporter configuration in version control and treat it as part of your Kafka infrastructure. Changes to it affect what your dashboards and alerts can observe.
Do not rely on consumer lag alone as a broker health signal. Consumer lag is a symptom; broker metrics tell you the cause.
In KRaft mode, audit your dashboard configuration during and after the migration. ZooKeeper-related MBeans are removed; KRaft-specific MBeans such as FencedBrokerCount and LastAppliedRecordLagMs appear in their place.
Link every production alert to a runbook that includes the metric’s JMX MBean source, related log entries to check, and remediation steps for the most common causes.

Monitor Kafka brokers with Factor House

Kpow surfaces broker health, partition state, ISR status, and throughput metrics from inside your own network. You get visibility into the metrics covered in this article without running a separate Prometheus stack or maintaining Grafana dashboards. Give it a try with a free 30-day trial.