Kafka broker monitoring

Brokers are the operational core of a Kafka cluster. They receive produce requests, serve fetch requests, manage partition replicas, and coordinate leader election. When a broker degrades, every producer and consumer connected to it is affected, often before your monitoring stack surfaces anything useful.

This article covers broker-level monitoring specifically: the JMX metrics to watch, thresholds to alert on, process monitoring scripts, and how to diagnose the most common failure modes. For cluster-wide concerns, consumer lag, and producer delivery guarantees, refer to the separate guides on Kafka cluster monitoring, Kafka consumer monitoring, and Kafka producer monitoring. For a full overview of Kafka observability, see the Kafka monitoring guide.

Key takeaways

What is Kafka broker monitoring?

A Kafka broker handles several concurrent responsibilities: writing incoming messages to disk, serving read requests from consumers, replicating partition data to follower brokers, and participating in leader election. Each of these activities has distinct failure modes.

Broker health is not the same as cluster health. A single degraded broker can produce replica lag, leader imbalance, and consumer fetch failures without the cluster appearing unavailable. Topics remain accessible on unaffected brokers, but the partitions whose leaders or followers are on the degraded broker will silently degrade. Identifying which broker is under pressure, and why, is the purpose of broker-level monitoring.

Monitoring spans three layers:

  1. Kafka process metrics (exposed via JMX): replication state, request throughput, controller status, failure rates.
  2. JVM metrics: heap usage, garbage collection pause duration, open file descriptors.
  3. Host OS metrics: disk capacity, disk I/O throughput, network utilization.

Consumer lag, producer delivery guarantees, and cluster-wide partition distribution are covered in separate articles. This article focuses on the broker process and what runs immediately around it.

How Kafka brokers expose metrics

JMX (default)

Kafka exposes metrics as JMX MBeans by default. Server-side broker metrics use Yammer Metrics internally; native Java clients use Kafka’s own metrics registry. Both project their measurements onto MBeans hosted in the JVM’s MBean server.

Remote JMX access is disabled by default. To enable it, set the JMX_PORT environment variable when starting the broker:

export JMX_PORT=9999

In production, raw JMX access is a security risk. Because JMX supports remote method invocation (RMI), an unauthenticated client can invoke operations on MBeans, including modifying runtime configurations and triggering JVM shutdowns. Use KAFKA_JMX_OPTS to enforce authentication and encryption:

export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \  -Dcom.sun.management.jmxremote.authenticate=true \  -Dcom.sun.management.jmxremote.ssl=true \  -Dcom.sun.management.jmxremote.port=9999 \  -Dcom.sun.management.jmxremote.rmi.port=9999 \  -Dcom.sun.management.jmxremote.password.file=/etc/kafka/jmxremote.password \  -Dcom.sun.management.jmxremote.access.file=/etc/kafka/jmxremote.access"

Tools such as jconsole, jmxterm, and kafka-run-class.sh kafka.tools.JmxTool can query JMX directly once access is enabled.

Prometheus via JMX Exporter

The standard approach for Prometheus-based stacks is to run the Prometheus JMX Exporter as a Java agent alongside the broker JVM.

Running the exporter in local agent mode (rather than as a standalone HTTP server) is preferred. It avoids the serialization overhead of remote RMI polling and captures additional host-level process metrics including JVM CPU and memory utilization.

The agent translates JMX MBeans to Prometheus metrics and serves them on an HTTP endpoint, typically on port 7071. It requires a YAML configuration file that maps MBean paths to Prometheus metric names. A community-maintained configuration for Kafka is available in the JMX Exporter GitHub repository. Keep this configuration file in version control alongside your broker configuration and review it when upgrading Kafka versions.

Kafka metrics reporters (pluggable)

Kafka supports custom MetricsReporter implementations via the metric.reporters setting in server.properties. This allows metrics to be pushed directly to external systems without going through JMX or Prometheus. Confluent Platform ships a reporter that sends metrics to Confluent Control Center. Vendors such as Datadog and New Relic provide their own reporter implementations as well.

KRaft mode: In KRaft mode (GA since Kafka 3.3), some ZooKeeper-related MBeans are removed and several KRaft-specific MBeans appear in their place, including FencedBrokerCount and LastAppliedRecordLagMs. If you are migrating from ZooKeeper mode, or if you have inherited dashboards built against an older cluster, audit your MBean paths before deploying them against a KRaft cluster.

Key metrics to monitor

The sections below cover metrics across all three layers. JMX MBean paths are given for direct JMX access; if you are using Prometheus, metric names follow the pattern produced by the JMX Exporter configuration.

Replication health metrics

These are the highest-priority metrics on any broker. A non-zero value in the first three rows means data availability is at risk.

Metric name JMX MBean path Description Why it matters
UnderReplicatedPartitions kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Partitions where the ISR count is below the replication factor A sustained non-zero value means the cluster has fewer replicas than configured to fall back on if a leader fails
UnderMinIsrPartitionCount kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount Partitions with fewer in-sync replicas than min.insync.replicas Producers configured with acks=all receive NotEnoughReplicasException when this is non-zero
AtMinIsrPartitionCount kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount Partitions where ISR count equals min.insync.replicas exactly One more broker going offline pushes these into the under-min state; treat as a warning signal
IsrShrinksPerSec kafka.server:type=ReplicaManager,name=IsrShrinksPerSec Rate at which replicas are leaving the ISR Sustained shrinks indicate follower replicas are falling behind, typically from disk pressure or GC pauses
IsrExpandsPerSec kafka.server:type=ReplicaManager,name=IsrExpandsPerSec Rate at which replicas are rejoining the ISR High expand rate alongside shrinks indicates ISR churn rather than a clean recovery

Controller metrics

There is always exactly one active controller in a Kafka cluster. These metrics confirm that invariant holds and that the controller is performing well.

Metric name JMX MBean path Description Why it matters
ActiveControllerCount kafka.controller:type=KafkaController,name=ActiveControllerCount Whether this broker is currently the active controller (0 or 1) The cluster-wide sum must be exactly 1. Zero means no metadata management; more than 1 indicates a split-brain condition
OfflinePartitionsCount kafka.controller:type=KafkaController,name=OfflinePartitionsCount Partitions with no available leader Non-zero means those partitions are completely unavailable to producers and consumers
LeaderElectionRateAndTimeMs kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs Frequency and duration of leader elections Frequent elections trigger consumer rebalances and producer retries; elevated duration suggests controller pressure
FencedBrokerCount kafka.controller:type=KafkaController,name=FencedBrokerCount Brokers the controller has marked as unreachable (KRaft only) Any fenced broker represents lost partition leadership and requires immediate investigation
LastAppliedRecordLagMs kafka.controller:type=KafkaController,name=LastAppliedRecordLagMs Metadata replication lag relative to the active controller (KRaft only) Should be 0 on the active controller; values above 1,000ms on standby nodes indicate metadata replication delay

Request handling metrics

These metrics indicate whether the broker is keeping up with its request load.

Metric name JMX MBean path Description Why it matters
RequestHandlerAvgIdlePercent kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Average idle percentage of the request handler thread pool Values below 0.20 indicate saturation on request processing
NetworkProcessorAvgIdlePercent kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent Average idle percentage of the network processor threads Target above 0.30; below 0.20 indicates network thread exhaustion, often caused by slow consumers
TotalTimeMs (Produce) kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce End-to-end latency for produce requests P99 target is under 100ms; sustained spikes indicate broker-side pressure or slow disk
TotalTimeMs (FetchConsumer) kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer End-to-end latency for consumer fetch requests Sustained elevation traces to replica lag or disk read pressure
RequestQueueSize kafka.network:type=RequestChannel,name=RequestQueueSize Requests waiting for a handler thread Should remain below 10; sustained growth indicates handler thread saturation
FailedProduceRequestsPerSec kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec Rate of failed produce requests Any non-zero value requires investigation
FailedFetchRequestsPerSec kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec Rate of failed fetch requests Any non-zero value requires investigation

Latency sub-phases. TotalTimeMs is the sum of five phases: time waiting in the request queue (RequestQueueTimeMs), local processing time on the partition leader (LocalTimeMs), replication wait time for acks=all producers (RemoteTimeMs), time waiting in the response queue (ResponseQueueTimeMs), and time to transmit the response to the client (ResponseSendTimeMs). When P99 TotalTimeMs is elevated, checking each sub-phase narrows the root cause. High RemoteTimeMs points to replication pressure. High LocalTimeMs typically indicates disk write saturation or message format conversion overhead.

Throughput and I/O metrics

These metrics give a baseline view of broker load and are useful for capacity planning.

Metric name JMX MBean path Description Why it matters
BytesInPerSec kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec Bytes received per second Sudden spikes can saturate disk write bandwidth or network interfaces
BytesOutPerSec kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec Bytes sent per second Includes replication traffic; track separately from consumer traffic where possible
MessagesInPerSec kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec Messages received per second Useful for understanding message rate independently of message size

Log and disk metrics

Metric name JMX MBean path Description Why it matters
OfflineLogDirectoryCount kafka.log:type=LogManager,name=OfflineLogDirectoryCount Log directories that are offline Any non-zero value means the broker cannot write to some partitions
DeadThreadCount kafka.log:type=LogCleaner,name=DeadThreadCount Log cleaner threads that have failed silently Non-zero means compaction has stopped; uncleanable-bytes will grow until disk fills
uncleanable-bytes kafka.log:type=LogCleaner,name=uncleanable-bytes Data volume waiting to be compacted A persistent upward trend on compacted topics indicates log cleaner failure
Disk space (OS) n/a – monitor via node exporter or equivalent Percentage of disk used on the Kafka log directory Kafka stops accepting produce requests when disk fills completely

JVM metrics

JVM pressure is a common root cause of ISR instability and request latency spikes. If a broker’s JVM pauses long enough during garbage collection, the broker fails to send heartbeats to the controller. This triggers a timeout and forces partition leadership reassignment, which looks from the outside like an ISR shrink or elevated election rate.

Metric name Source Description Why it matters
JVM heap used java.lang:type=Memory Heap in use as a percentage of max Sustained above 75% increases GC pressure and pause frequency
GC pause duration java.lang:type=GarbageCollector Duration of stop-the-world GC events Pauses above approximately 1 second cause ISR shrinks and heartbeat timeouts; the G1GC target is 20ms
GC frequency java.lang:type=GarbageCollector Number of GC events per minute High frequency of short pauses often precedes a longer pause event
Open file descriptors java.lang:type=OperatingSystem Number of open file handles Kafka opens many handles for log segments; hitting the OS limit causes broker errors

Heap sizing. Kafka uses a hybrid memory model: a small JVM heap handles partition metadata, indexes, and producer state, while the OS page cache holds hot log segment data. Allocating too much to the JVM heap reduces the page cache, which forces consumer reads to disk. On a 64 GB host, a typical configuration is 6-12 GB for the JVM heap and the remainder for the OS page cache. A rough rule of thumb is 1-2 MB of heap per active partition replica hosted on the broker.

Broker process monitoring script

The JMX metrics above require the broker to be running and reachable. A process-level check catches failures that happen before or outside the metrics stack: a broker crash, a port that is not listening, a startup failure, or a JVM that is hung during initialization. The scripts below complement JMX monitoring rather than replacing it.

Check the broker process is running (jps / ps)

jps (JVM process status, included in the JDK) lists running JVM processes by main class. For Apache Kafka, the main class is kafka.Kafka.

`#!/bin/bash
BROKER_CLASS=“kafka.Kafka”

if jps -l | grep -q “${BROKER_CLASS}”; then
 echo “Broker process is running”
 exit 0
else
 echo “ERROR: Kafka broker process not found”
 exit 1
fi`

If jps is not available, fall back to ps:

#!/bin/bash if ps aux | grep -q '[k]afka.Kafka'; then  echo "Broker process is running"  exit 0 else  echo "ERROR: Kafka broker process not found"  exit 1 fi

Confluent Platform: The main class for Confluent Server is io.confluent.kafka.server.ConfluentServer. Adjust the grep pattern accordingly.

A running process does not mean the broker is healthy. It may be in an error state or hung waiting on disk or network. Use this check as a first-line detector, not a health indicator.

Check the broker port is accepting connections

A process check confirms the JVM is alive, but not that it is accepting Kafka connections. Check that the listener port (default 9092) is open and accepting connections:

`#!/bin/bash
BROKER_HOST=“localhost”
BROKER_PORT=9092
TIMEOUT=5

if nc -z -w “${TIMEOUT}” “${BROKER_HOST}” “${BROKER_PORT}” 2>/dev/null; then
 echo “Broker port ${BROKER_PORT} is accepting connections”
 exit 0
else
 echo “ERROR: Broker not accepting connections on ${BROKER_HOST}:${BROKER_PORT}”
 exit 1
fi`

In environments without netcat, use the /dev/tcp bash built-in:

`#!/bin/bash
BROKER_HOST=“localhost”
BROKER_PORT=9092

if (echo > /dev/tcp/${BROKER_HOST}/${BROKER_PORT}) 2>/dev/null; then
 echo “Broker port ${BROKER_PORT} is accepting connections”
 exit 0
else
 echo “ERROR: Broker not accepting connections on ${BROKER_HOST}:${BROKER_PORT}”
 exit 1
fi`

JMX-based health check script

A port check validates TCP connectivity but not broker health. Use kafka-run-class.sh with kafka.tools.JmxTool to query UnderReplicatedPartitions directly. This works without additional tooling if the JDK and Kafka binaries are available.

`#!/bin/bash
JMX_HOST=“localhost”
JMX_PORT=9999
MBEAN=“kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions”

result=$(kafka-run-class.sh kafka.tools.JmxTool \
 –jmx-url “service:jmx:rmi:///jndi/rmi://${JMX_HOST}:${JMX_PORT}/jmxrmi” \
 –object-name “${MBEAN}” \
 –attributes Value \
 –one-time true 2>/dev/null | tail -n 1 | awk -F’,’ ‘{print $NF}’)

if [ -z “${result}” ]; then
 echo “ERROR: Could not retrieve JMX metric”
 exit 2
fi

if [ “${result}” -gt 0 ]; then
 echo “WARNING: UnderReplicatedPartitions = ${result}”
 exit 1
else
 echo “OK: UnderReplicatedPartitions = 0”
 exit 0
fi`

This gives a more meaningful health signal than a port check. A broker can accept TCP connections while being in a degraded replication state.

Kafka Admin API check

kafka-broker-api-versions.sh is a lightweight liveness check that validates the Kafka protocol layer, not just TCP connectivity:

kafka-broker-api-versions.sh --bootstrap-server localhost:9092

If the broker is healthy, this returns the list of supported API versions. If the protocol handshake fails or the broker is not responsive, it exits with an error. Use this as a quick manual check or wrap it in a script for automated monitoring. It is a useful final check after a rolling restart to confirm each broker has rejoined the cluster before proceeding to the next.

Broker monitoring tools

Prometheus and Grafana

The most common self-hosted stack. The Prometheus JMX Exporter scrapes broker JMX metrics, Prometheus stores them as time-series data, and Grafana renders dashboards. Community dashboards for Kafka are available on Grafana Labs, though quality and metric coverage vary. Maintaining a complete and accurate Kafka dashboard requires ongoing attention as cluster topology and Kafka versions change.

Managed monitoring platforms (Datadog, New Relic)

Agent-based collection where vendor agents handle JMX scraping and forwarding. Both Datadog and New Relic provide out-of-the-box Kafka dashboards and alert policies, which reduces setup time compared to the self-hosted Prometheus stack. Cost scales with host count and metrics volume. Default alert thresholds may need adjustment for high-throughput deployments.

Confluent Control Center

Available in Confluent Platform, not Apache Kafka. Provides deep integration with Confluent-specific metrics and a built-in interface for broker health, topic management, and consumer lag. Less useful if you are running vanilla Apache Kafka.

Kpow by Factor House

Kpow is purpose-built for Kafka observability. It surfaces broker health, partition state, ISR status, and throughput metrics without requiring a separate Prometheus stack or Grafana instance, and it runs inside your own network. Try it free for 30 days.

Alerting strategy for broker monitoring

The goal is to page on-call when there is an active data risk and route everything else to a dashboard or asynchronous channel. Static thresholds on percentage-based metrics can generate false alarms during quiet periods: a single failed request in a low-traffic window can push an error rate to 50% and trigger a high-severity alert despite having no real impact. Organize alerts by severity:

Priority Area Metrics Check interval Route
P0 Immediate cluster health OfflinePartitionsCount, ActiveControllerCount 15 seconds On-call pager
P1 Data safety UnderReplicatedPartitions, IsrShrinksPerSec, FencedBrokerCount 30 seconds High-priority chat or pager
P2 Performance signals TotalTimeMs P99, broker throughput, thread idle percentages 1 minute Slack channel
P3 Capacity planning CPU, memory pools, disk usage 5 minutes Email or ticket
P4 Deep diagnostics Per-partition metrics, thread dumps As needed Dashboard only

Recommended thresholds based on the Apache Kafka documentation and operational guidance:

Metric Warning threshold Critical threshold Notes
UnderReplicatedPartitions > 0 for > 2 minutes > 0 for > 5 minutes Short-lived spikes during rolling restarts are expected
ActiveControllerCount (cluster sum) != 1 Any deviation is immediately critical; no grace period
OfflinePartitionsCount > 0 > 0 Alert immediately; affected partitions are unavailable
RequestHandlerAvgIdlePercent < 0.30 < 0.20 Evaluate on a sustained basis, not momentary spikes
NetworkProcessorAvgIdlePercent < 0.30 < 0.20 Same as above
Disk usage > 70% > 85% Kafka stops accepting produce requests when the log directory fills completely
GC pause duration > 500ms > 1 second Stop-the-world pauses above 1 second cause ISR shrinks and heartbeat timeouts

Link every production alert to a runbook. When an alert fires, the on-call engineer should be able to identify the root cause and begin remediation without first looking up what the metric means. A useful runbook includes the metric’s JMX MBean source, related broker log entries to check, and step-by-step remediation for the most common causes of that alert.

Common broker issues and how to diagnose them

Symptom Relevant metrics Likely root cause Diagnosis and remediation
UnderReplicatedPartitions rising UnderReplicatedPartitions, IsrShrinksPerSec Follower broker is slow or unreachable Check GC pause times and network connectivity on the lagging follower; check disk I/O on the follower
Produce latency spikes TotalTimeMs (Produce), LocalTimeMs, disk I/O Disk write saturation Decompose TotalTimeMs into sub-phases; check disk throughput; consider partition reassignment to less-loaded brokers
OfflinePartitions suddenly non-zero OfflinePartitionsCount, ActiveControllerCount Leader broker crashed with no available replicas Check broker startup logs; if replication factor was 1, data may be lost
Network thread exhaustion NetworkProcessorAvgIdlePercent Too many slow consumers or large fetch sizes Increase num.network.threads; investigate slow consumers; consider reducing fetch size on affected consumers
Request handler exhaustion RequestHandlerAvgIdlePercent Too many concurrent requests or heavy compression Increase num.io.threads; check for compression-related CPU spikes
Broker JVM out of memory JVM heap, GC events Heap too small for partition count Increase broker heap (-Xmx); check for large message batches; review partition replica count on that broker
Broker failing to start OfflineLogDirectoryCount Corrupt log directory or disk failure Check broker startup logs; run kafka-log-dirs.sh to identify the offline directory
ISR churning (shrinks and expands) IsrShrinksPerSec, IsrExpandsPerSec GC pauses or network jitter causing replicas to briefly fall behind Check GC logs; consider tuning replica.lag.time.max.ms; verify network stability between brokers
Disk space growing unexpectedly uncleanable-bytes, DeadThreadCount Log cleaner thread has failed silently Check whether DeadThreadCount is greater than 0; if log cleaner threads have died, a broker restart is required

Kafka broker monitoring best practices

Monitor Kafka brokers with Factor House

Kpow surfaces broker health, partition state, ISR status, and throughput metrics from inside your own network. You get visibility into the metrics covered in this article without running a separate Prometheus stack or maintaining Grafana dashboards. Give it a try with a free 30-day trial.