As custodians and builders of the streaming platform at Grab operating at massive scale (think terabytes of data ingress each hour), the Coban team’s mission is to provide a NoOps, managed platform for seamless, secure access to event streams in real-time, for every team at Grab.
Streaming systems are often at the heart of event-driven architectures, and what starts as a need for a simple message bus for asynchronous processing of events quickly evolves into one that requires a more sophisticated stream processing paradigms. Earlier this year, we saw common patterns of event processing emerge across our Go backend ecosystem, including:
Filtering and mapping stream events of one type to another
Aggregating events into time windows and materializing them back to the event log or to various types of transactional and analytics databases
Generally, a class of problems surfaced which could be elegantly solved through an event sourcing1 platform with a stream processing framework built over it, similar to the Keystone platform at Netflix2.
This article details our journey building and deploying an event sourcing platform in Go, building a stream processing framework over it, and then scaling it (reliably and efficiently) to service over 300 billion events a week.
Message queues like Apache Kafka are a common component of distributed systems. This blog post will look at several different strategies for improving performance when working with message queues.
Kafka consists of topics which have one or more partitions.
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
With this structured commit log, each consumer follows the same basic steps:
A consumer is assigned a particular topic-partition (either manually or automatically via a consumer group)
The previous offset is read so that the consumer will begin where it last left off
Messages are consumed from Kafka
Messages are processed in some way
The processed message offset is committed back to Kafka
Other types of message queues (like AMQP) have a similar flow – messages are consumed, processed and acknowledged. Generally we rely on idempotent message processing – that is the ability to process the same message twice with no ill effect – and err on the side of only committing if we’re certain we’ve done what we need to. This gives us durability and guarantees that every message will be processed, even if our consumer process crashes.
This post is Part 1 of a 3-part series about monitoring Kafka. Part 2 is about collecting operational data from Kafka, and Part 3 details how to monitor Kafka with Datadog.
What is Kafka?
Kafka is a distributed, partitioned, replicated, log service developed by LinkedIn and open sourced in 2011. Basically it is a massively scalable pub/sub message queue architected as a distributed transaction log. It was created to provide “a unified platform for handling all the real-time data feeds a large company might have”.1