Monday, February 20, 2017

Apache Kafka - An Overview

Excerpted from:

  1. Message Oriented Middleware (MOM) such as Apache Qpid, RabbitMQ, Microsoft Message Queue, and IBM MQ Series were used for exchanging messages across various components. While these products are good at implementing the publisher/subscriber pattern (Pub/Sub), they are not specifically designed for dealing with large streams of data originating from thousands of publishers. Most of the MOM software have a broker that exposes Advanced Message Queuing Protocol (AMQP) protocol for asynchronous communication.
  2. Kafka is designed from the ground up to deal with millions of firehose-style events generated in rapid succession. It guarantees low latency, “at-least-once”, delivery of messages to consumers. Kafka also supports retention of data for offline consumers, which means that the data can be processed either in real-time or in offline mode.
  3. Kafka is designed to be a distributed commit log. Much like relational databases, it can provide a durable record of all transactions that can be played back to recover the state of a system. 
  4. Kafka provides redundancy, which ensures high availability of data even when one of the servers faces disruption.
  5. Multiple event sources can concurrently send data to a Kafka cluster, which will reliably gets delivered to multiple destinations.
  6. Key concepts:
  • Message - Each message is a key/value pair. Irrespective of the data type, Kafka always converts messages into byte arrays.
  • Producers - or publisher clients that produce data
  • Consumers - are subscribers or readers that read the data. Unlike subscribers in MOM, Kafka consumers are stateful, which means they are responsible for remembering the cursor position, which is called as an offset. The consumer is also a client of Kafka cluster. Each consumer may belong to a consumer group, which will be introduced in the later sections.
    The fundamental difference between MOM and Kafka is that the clients will never receive messages automatically. They have to explicitly ask for a message when they are ready to handle.
  • Topics - logical collection of messages. Data sent by producers are stored in topics. Consumers subscribe to a specific topic that they are interested in.
  • Partition - Each topic is split into one or more partitions. They are like shards and Kafka may use the message key to automatically group similar messages into a partition. This scheme enables Kafka to dynamically scale the messaging infrastructure. Partitions are redundantly distributed across the Kafka cluster. Messages are written to one partition but copied to at least two more partitions maintained on different brokers with the cluster.
  • Consumer groups - consumers belong to at least one consumer group, which is typically associated with a topic. Each consumer within the group is mapped to one or more partitions of the topic. Kafka will guarantee that a message is only read by a single consumer in the group. Each consumer will read from a partition while tracking the offset. If a consumer that belongs to a specific consumer group goes offline, Kafka can assign the partition to an existing consumer. Similarly, when a new consumer joins the group, it balances the association of partitions with the available consumers.
    It is possible for multiple consumer groups to subscribe to the same topic. For example, in the IoT use case, a consumer group might receive messages for real-time processing through an Apache Storm cluster. A different consumer group may also receive messages from the same topic for storing them in HBase for batch processing.
    The concept of partitions and consumer groups allows horizontal scalability of the system.
  • Broker - Each Kafka instance belonging to a cluster is called a broker. Its primary responsibility is to receive messages from producers, assigning offsets, and finally committing the messages to the disk. Based on the underlying hardware, each broker can easily handle thousands of partitions and millions of messages per second.
    The partitions in a topic may be distributed across multiple brokers. This redundancy ensures the high availability of messages.
  • Cluster - A collection of Kafka broker forms the cluster. One of the brokers in the cluster is designated as a controller, which is responsible for handling the administrative operations as well as assigning the partitions to other brokers. The controller also keeps track of broker failures.
  • Zookeeper - Kafka uses Apache ZooKeeper as the distributed configuration store. It forms the backbone of Kafka cluster that continuously monitors the health of the brokers. When new brokers get added to the cluster, ZooKeeper will start utilizing it by creating topics and partitions on it.
Kafka in docker - 

No comments:

15 sorting algorithms visualized in 5 minutes, with awesome arcade sounds

15 sorting algorithms visualized in 5 minutes, with awesome arcade sounds from r/programming