This post is Part 1 of a 3-part series about monitoring Kafka. Part 2 is about collecting operational data from Kafka, and Part 3 details how to monitor Kafka with Datadog.
What is Kafka?
Kafka is a distributed, partitioned, replicated, log service developed by LinkedIn and open sourced in 2011. Basically it is a massively scalable pub/sub message queue architected as a distributed transaction log. It was created to provide “a unified platform for handling all the real-time data feeds a large company might have”.1
There are a few key differences between Kafka and other queueing systems like RabbitMQ, ActiveMQ, or Redis’s Pub/Sub:
- As mentioned above, it is fundamentally a replicated log service.
- It does not use AMQP or any other pre-existing protocol for communication. Instead, it uses a custom binary TCP-based protocol.
- It is very fast, even in a small cluster.
- It has strong ordering semantics and durability guarantees.
Despite being pre-1.0, (current version is 0.9.0.1), it is production-ready, and powers a large number of high-profile companies including LinkedIn, Yahoo, Netflix, and Datadog.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics