Recommended Kafka Consumer Metrics

Use these standard Kafka consumer metrics to monitor 5G event streaming performance.

Consumer Metrics describes the recommended consumer metrics. The MBean for these metrics is: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]),topic=([-.w]),partition=([-.w]+).

Table 1. Consumer Metrics
Name Description
records-lag The number of messages a consumer is behind a producer on this partition.
records-lag-max The maximum number of messages a consumer is behind a producer, either for a specific partition or across all partitions on this client.
records-consumed-rate The average number of records consumed per second for a specific topic or across all topics.
bytes-consumed-rate The average number of bytes consumed per second for a specific topic or across all topics.
fetch-rate The number of fetch requests per second from a consumer.

Records Lag

The records-lag metric is the calculated difference between the current log offset for a consumer and the current log offset for a producer. Consistently high lag values might indicate overloaded consumers, in which case both provisioning more consumers and splitting topics across more partitions might help increase throughput and reduce lag.

The records-lag-max metric is the maximum observed value of records-lag.

Consumed Rate

The records-consumed-rate and bytes-consumed-rate metrics are measures of consumer network throughput. A sudden drop in the rate of records consumed (records-consumed-rate) may indicate a failing consumer, but if its network throughput (bytes-consumed-rate) remains constant, that may indicate that it consuming records that are larger in size and fewer in number. Observing traffic volume over time, in the context of other metrics, is important for diagnosing anomalous network usage.

Fetch Rate

The fetch rate of a consumer can be a good indicator of overall consumer health. A minimum fetch rate approaching a value of zero might signal an issue on the consumer. The minimum fetch rate is usually be nonzero, so this value decreasing might indicate consumer failure.

Examples

Figure 1 shows a Grafana dashboard displaying the wanted behavior of consumption of CDR records by the ASN.1 Streamer application as reflected in consumer metrics.

Figure 1. ASN.1 Streamer Application Consumer Metrics
ASN.1 Streamer Application Consumer Metrics

Figure 1 higlights the following events.

  1. The load test begins and the consumer starts fetching and consuming records.
  2. The load test ends. The rate of consumption of records and bytes have been relatively constant.
  3. The consumer ends processing of all records. There is no reported lag in the consumption of records.

Figure 2 shows a Grafana dashboard displaying the the behavior of the ASN.1 SFTP Sink when overloaded by sending more records than it can fetch and process.

Note: In this case, the ASN.1 Streamer application is not optimized for production. The absolute values shown below are not representative of performance.
Figure 2. ASN.1 Sink Overload
ASN.1 Sink Overload

Figure 2 highlights the following events.

  1. The load test begins.
  2. The fetch rate drops and stabilizes. Records begin to lag.
  3. The load test stops.
  4. The ASN.1 Sink application starts to recover.

The load test shows the fetch rate dropping to almost zero, corresponding to a rising records lag maximum. After the load test completes, the consumer takes time to catch up.