Recommended Kafka Producer Metrics

Use these standard Kafka producer metrics to monitor the 5G event streaming performance.

Producer Metrics describes the recommended producer metrics. The MBean for these metrics is kafka.producer:type=producer-metrics,client-id=([-.w]+).

Table 1. Producer Metrics
Name Description
response-rate The average number of responses received per second.
request-rate The average number of requests sent per second.
request-latency-avg The average request latency, in milliseconds.
outgoing-byte-rate The average number of outgoing/incoming bytes per second.
io-wait-time-ns-avg The average length of time the I/O thread spent waiting for a socket, in nanoseconds.
batch-size-avg The average number of bytes sent per partition per request.

Response Rate

For producers, the response-rate metric reports the rate of responses received from brokers. Brokers respond to producers when the data has been received. Depending on your configuration, received can mean one of three things:

  • The message was received, but not committed (request.required.acks == 0).
  • The leader has written the message to disk (request.required.acks == 1).
  • The leader has received confirmation from all replicas that the data has been written to disk (request.required.acks == all).

Producer data is not available for consumption until the required number of acknowledgments have been received. To diagnose low response rates, check the request.required.acks configuration directive on your brokers. Choosing the right value for request.required.acks is entirely use case dependent. The tradeoff is between availability and consistency.

Request Rate

The request-rate metric reports the rate at which producers send data to brokers. A request rate indicating issue-free operation varies depending on the use case. Check peaks and drops to ensure continuous service availability. If rate-limiting is not enabled, traffic spikes can cause brokers to slow down as they process a rapid influx of data.

Request Latency Average

The request-latency-avg metric is the amount of time between when KafkaProducer.send() is called and the producer receives a response from the broker.

Producers do not necessarily send each message as soon as it is created. The linger.ms value for the producer determines the maximum wait time before sending a message batch. This can allow collection of a larger batch of messages before sending them in a single request. The default value of linger.ms is zero milliseconds. Setting this to a higher value can increase latency, but it can also help improve throughput as the producer can send multiple messages without incurring network overhead for each one.

Latency has a strong correlation with throughput. If you increase linger.ms to improve throughput, watch request latency to ensure it does not rise beyond an acceptable limit. Modifying the value of batch.size in your producer configuration can lead to significant gains in throughput. Determining an optimal batch size is largely use-case dependent, but in general, increase batch size if you have available memory.

Note: Small batches involve more network round trips, which can reduce throughput.

Outgoing Byte Rate

As with Kafka brokers, watch the outgoing-byte-rate metric of Kafka producers for producer network throughput. Observing traffic volume over time is essential for determining whether you must make network infrastructure changes. Monitoring producer network traffic informs decisions on infrastructure changes, and provides a perspective on the production rate of producers, making it easier to identify sources of excessive traffic.

I/O Wait Time

If producers are producing more data than they can send, they end up waiting for network resources. But if producers are not rate-limited or reaching bandwidth maximums, issues become harder to identify. Because disk access tends to be the slowest segment of any processing task, checking the io-wait-time-ns-avg metric on your producers is a good place to start.

I/O wait time is the percentage of time spent performing I/O while the CPU is idle. Excessive wait times might indicate producers are unable to get the data fast enough. If you are using traditional hard drives for storage, you may want to consider SSDs instead.

Batch Size

To use network resources more efficiently, Kafka producers group messages into batches before sending them. The producer waits to accumulate an amount of data defined by batch.size (16 KB by default), up to the maximum specified in linger.ms (0 milliseconds by default.) The batch-size-avg metric shows fluctuations in the average size. If batches sent by a producer are consistently smaller than the value of batch.size, any time your producer spends lingering is wasted waiting for more data that never arrives. Consider reducing your linger.ms setting if the value of batch-size-avg is lower than your configured batch.size.

Examples

Figure 1 shows a Grafana dashboard displaying production of ASN.1-encoded records by the ASN.1 Streamer application as reflected in producer metrics.

Figure 1. Production Metrics Displayed in Grafana

Figure 1 highlights the following events:

  1. The load test starts and the consumer starts fetching and consuming records.
  2. The load test stops. The rate of consumption of records and bytes have been relatively constant.
  3. The consumer stops processing of all records.