Kafka - Architecture
What is Kafka?
Kafka is an event-streaming platform that is designed to process high volumes of data in real-time. Developed by LinkedIn in 2011, it has quickly become the infrastructural backbone of companies like Netflix, Twitter, and Spotify.
Why do we need Kafka?
In today’s data-driven world, tracking information like user clicks, recommendations, and shopping carts can be invaluable for a company’s growth. With these analytics, companies can make the product improvements needed to boost user engagement and conversion rates. However, on sites with millions of daily users, collecting and analyzing this data is nontrivial.
Kafka was designed to streamline this operation, acting as a robust tool that maintains efficient, real-time processing capabilities with incredible quantities of data. For instance, as of late 2019, LinkedIn’s Kafka deployments were managing more than 7 trillion messages per day.
How does Kafka work?
Kafka provides a structured architecture through which you can track and analyze events in your application. This design is fairly nuanced and there are many moving parts, so the focus of this article will be to demystify the Kafka architecture.
The “bulletin board” analogy
To keep things simple and concrete, let’s start off by thinking of Kafka as a simple bulletin board. It allows you to keep track of the various messages being posted throughout the day. Individuals can interact with the bulletin board in two ways:
- An individual can “post” a message onto the bulletin board
- An individual can “read” a message from the bulletin board
Let’s call those who post messages “producers”, and those who read messages “consumers”.
Topics
More specifically, our bulletin board can be broken down into various categories. Let’s call these “topics”. Our producers can choose to post messages to specific topic(s), and our consumers can choose to read messages from particular topic(s).
In a non-technical setting, these topics may be sports, politics, fashion, and food. For a more technical use case, each topic may represent an event that we want to track, let’s say these will be posts, messages, network, and profile. We want to record information about what posts users are making, what messages they’re sending and receiving, who they’re networking with, and which profiles they’re viewing. Here’s a picture so you can visualize our bulletin board (in other words, our Kafka system) so far.
Breaking down topics
Now, let’s zoom in on one of these topics. For this article, we’ll choose “Posts”. We’re going to dig a bit deeper to see how each topic works under the hood.
Within our “Posts” topic, Kafka will store events in a queue. This ensures that we can retain the correct order of incoming messages. However, with large volumes of data, that single queue can quickly get overwhelmed. So, Kafka is going to split our data up into multiple different queues, each of which we will call a “partition”. This idea is similar to data sharding/partitioning in systems design. In our example, let’s say posts whose title starts with A-M will go in partition 0, and posts whose title starts with N-Z will go in partition 1. Here’s a diagram to make this more concrete:
Each partition will store a series of events in order. In Partition 0 (A-M), we can see the chronologic breakdown of posts whose title starts with A-M. First, someone posted about getting a job. Next, someone posted about their GitHub repo, and so on. Underneath each queue, you can see the indices within our queues; these are referred to as “offsets”.
This design allows producers to seamlessly post messages to each topic, in addition to making it easy for consumers to read data in the correct order. Now that we’ve covered the architecture, here’s a picture that shows the high-level data flow through a Kafka system:
A few notes on precision
It’s worth refining our language when talking about consumers. It is true that our consumers are reading events from various topics. However, to be more precise, consumers “subscribe” to specific topics. This allows them to read the events from that specific topic. There are 2 finer details I would like to highlight:
- The arrows are pointing from the consumers to the “bulletin board”, not the other way around. Kafka does not push messages out to consumers. Instead, the consumers extract necessary information from Kafka.
- Kafka does not store any information about what messages a particular consumer has already seen. Instead, whenever a consumer is going to read data from a topic, it will provide a specific offset (in other words, index) that tells Kafka where to start reading from.
Finally, let’s try to be more precise about this “bulletin board”. The Kafka architecture is composed of “brokers”, which are just servers containing one or more topics, similar to the bulletin boards in our analogy. A Kafka “cluster” is simply a group of brokers. You can think of this as distributing a large number of messages across several different bulletin boards.
Summary
- Kafka is a design architecture that acts as a “bulletin board”, allowing producers to send messages and consumers to read messages (by subscribing to topics)
- Each topic stores data in various partitions (or queues) to preserve the order of information
- Kafka’s structured approach to processing information enables companies to monitor event-driven data to inform product and engineering decisions
One final note: since Kafka applications are complex, there is always room to improve both redundancy and efficiency.
Comments
Post a Comment