Apache Kafka

January 29, 2015 - BigData / Book notes

Introduction

In this blog post, I document useful notes and guidelines on Apache Kafka. If you have questions or comments, please don’t hesitate to e-mail me.

Background

Kafka is a solution… to deal with real-time volumes of information and route it to multiple consumers quickly 1Garg, Nishant. “Introducing Kafka.” Apache Kafka. Birmingham: Packt, 2013. Print.

Key characteristics of Apache Kafka:

  • Designed to support millions of messages per second
  • Constant-time (O(1)) performance even with increased data loads
  • Real-time focus – immediate consumption of produced messages
  • Distribution of message consumption over a cluster of machines
  • Consumers hold the state of message ordering, allowing consumers to “rollback” time and review old messages

Kafka is optimized for scenarios where one needs multiple data pipelines, i.e. a relay station passing different kinds of messages between a wide range of producers and consumers while maintaining ordering per message type 2Shapira, Gwen, and Jeff Holoman. “Apache Kafka for Beginners.” Cloudera Engineering Blog. Cloudera, 12 Sept. 2014. Web. 29 Jan. 2015. <http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/>.

The idea of the partition is critical in Kafka: a partition stores messages in-order for a specific consumer. This is not to be confused with replication, which helps deal with recovery after broker failures. A partition can be thought of as a specific subset of a given topic’s messages that are intended for a single consumer. 3“Kafka 0.8.1 Documentation.” Apache Kafka. Apache Software Foundation. Web. 29 Jan. 2015. <http://kafka.apache.org/documentation.html>.

The consumer group for a particular Kafka topic has two variants:

  • All consumers share the same consumer group name – this scenario represents the case where only a single consumer will consume all the messages, and the remaining consumers will begin consuming only if the lead consumer fails. This is useful for scalability and fault tolerance purposes.
  • All consumers differ in consumer group names – this scenario represents the case where each consumer will be broadcast all the messages.

Kafka only provides a total order over messages within a partition, not between different partitions in a topic… if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process. 4“Kafka 0.8.1 Documentation. #Consumers” Apache Kafka. Apache Software Foundation. Web. 29 Jan. 2015. <http://kafka.apache.org/documentation.html>.

When using Kafka across data centers, consider using message compression to facilitate lower bandwidth usage 5Narkhede, Neha. “Compression – Apache Kafka.” Compression – Apache Kafka. Apache Software Foundation, 1 Jan. 2011. Web. 29 Jan. 2015. <https://cwiki.apache.org/confluence/display/KAFKA/Compression>

There are two types of replication modes in Kafka: 6Garg, Nishant. “Kafka Design.” Apache Kafka. Birmingham: Packt, 2013. Print.

  1. Synchronous replication – a message sent from a producer is first stored by a lead (replica) broker, and forwarded to all other (replica) brokers. Once all replica brokers have aknowledged receipt, the lead replica broker informs the producer and it can move on to the next message
  2. Asynchronous replication – a message sent from a producer is first stored by a lead replica broker, but also acknowledged by that broker for the lead producer. The key is that other replica brokers don’t need to store the message before the producer can move on to the next message

Cluster configuration

Steps to setting up a clustered (multi-node, multi-broker) Kafka configuration:

  1. Download Apache Kafka to every node in your cluster, assume it is installed at $KAFKA_HOME
  2. Add number of configuration files under $KAFKA_HOME/config , each representing the configuration of a single broker. Copy the contents of $KAFKA_HOME/config/server.properties  into something like $KAFKA_HOME/config/server-X.properties
  3. For each configuration file:
    1. Make sure to change the broker.id  property for each configuration file to be an integer representing the individual broker
    2. Set the zookeeper.connect  property to the ensemble of nodes that your Zookeeper instance is running on. NOTE: it’s good practice to add a directory name at the end of the Zookeeper URL and port to enable sharing of Zookeeper with other applications. e.g. host:port/kafka
    3. For very data heavy real-time applications, consider setting log.retention.hours=1  and log.cleaner.enable=true  to prevent too much log data being written to disk

References   [ + ]

1. Garg, Nishant. “Introducing Kafka.” Apache Kafka. Birmingham: Packt, 2013. Print.
2. Shapira, Gwen, and Jeff Holoman. “Apache Kafka for Beginners.” Cloudera Engineering Blog. Cloudera, 12 Sept. 2014. Web. 29 Jan. 2015. <http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/>.
3. “Kafka 0.8.1 Documentation.” Apache Kafka. Apache Software Foundation. Web. 29 Jan. 2015. <http://kafka.apache.org/documentation.html>.
4. “Kafka 0.8.1 Documentation. #Consumers” Apache Kafka. Apache Software Foundation. Web. 29 Jan. 2015. <http://kafka.apache.org/documentation.html>.
5. Narkhede, Neha. “Compression – Apache Kafka.” Compression – Apache Kafka. Apache Software Foundation, 1 Jan. 2011. Web. 29 Jan. 2015. <https://cwiki.apache.org/confluence/display/KAFKA/Compression>
6. Garg, Nishant. “Kafka Design.” Apache Kafka. Birmingham: Packt, 2013. Print.

Leave a Reply

Your email address will not be published. Required fields are marked *