Recommended Kafka Consumer Metrics
Use these standard Kafka consumer metrics to monitor 5G event streaming performance.
Consumer Metrics describes the recommended consumer metrics. The MBean for these metrics is: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]),topic=([-.w]),partition=([-.w]+).
Name | Description |
---|---|
records-lag | The number of messages a consumer is behind a producer on this partition. |
records-lag-max | The maximum number of messages a consumer is behind a producer, either for a specific partition or across all partitions on this client. |
records-consumed-rate | The average number of records consumed per second for a specific topic or across all topics. |
bytes-consumed-rate | The average number of bytes consumed per second for a specific topic or across all topics. |
fetch-rate | The number of fetch requests per second from a consumer. |
Records Lag
The records-lag metric is the calculated difference between the current log offset for a consumer and the current log offset for a producer. Consistently high lag values might indicate overloaded consumers, in which case both provisioning more consumers and splitting topics across more partitions might help increase throughput and reduce lag.
The records-lag-max metric is the maximum observed value of records-lag.
Consumed Rate
The records-consumed-rate and bytes-consumed-rate metrics are measures of consumer network throughput. A sudden drop in the rate of records consumed (records-consumed-rate) may indicate a failing consumer, but if its network throughput (bytes-consumed-rate) remains constant, that may indicate that it consuming records that are larger in size and fewer in number. Observing traffic volume over time, in the context of other metrics, is important for diagnosing anomalous network usage.
Fetch Rate
The fetch rate of a consumer can be a good indicator of overall consumer health. A minimum fetch rate approaching a value of zero might signal an issue on the consumer. The minimum fetch rate is usually be nonzero, so this value decreasing might indicate consumer failure.
Examples
Figure 1 shows a Grafana dashboard displaying the wanted behavior of consumption of CDR records by the ASN.1 Streamer application as reflected in consumer metrics.
Figure 1 higlights the following events.
- The load test begins and the consumer starts fetching and consuming records.
- The load test ends. The rate of consumption of records and bytes have been relatively constant.
- The consumer ends processing of all records. There is no reported lag in the consumption of records.
Figure 2 shows a Grafana dashboard displaying the the behavior of the ASN.1 SFTP Sink when overloaded by sending more records than it can fetch and process.
Figure 2 highlights the following events.
- The load test begins.
- The fetch rate drops and stabilizes. Records begin to lag.
- The load test stops.
- The ASN.1 Sink application starts to recover.
The load test shows the fetch rate dropping to almost zero, corresponding to a rising records lag maximum. After the load test completes, the consumer takes time to catch up.