Monitoring Kafka performance metrics

This post is Part 1 of a 3-part series about monitoring Kafka. Part 2 is about collecting operational data from Kafka, and Part 3 details how to monitor Kafka with Datadog.

What is Kafka?

Kafka is a distributed, partitioned, replicated, log service developed by LinkedIn and open sourced in 2011. Basically it is a massively scalable pub/sub message queue architected as a distributed transaction log. It was created to provide “a unified platform for handling all the real-time data feeds a large company might have”.1

There are a few key differences between Kafka and other queueing systems like RabbitMQ, ActiveMQ, or Redis’s Pub/Sub:

  1. As mentioned above, it is fundamentally a replicated log service.
  2. It does not use AMQP or any other pre-existing protocol for communication. Instead, it uses a custom binary TCP-based protocol.
  3. It is very fast, even in a small cluster.
  4. It has strong ordering semantics and durability guarantees.

Despite being pre-1.0, (current version is, it is production-ready, and powers a large number of high-profile companies including LinkedIn, Yahoo, Netflix, and Datadog.