Apache Kafka

Apache Kafka: A Comprehensive Overview

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform that allows the real-time processing of high volumes of data. Kafka is mainly used for building real-time streaming data pipelines and applications. It provides capabilities for data integration and processing, including features for message storage and transmission across multiple consumers.

Why Use Kafka When We Have Spark Structured Streaming?

While Spark Structured Streaming allows for real-time stream processing, Kafka is necessary for:

Data Integration / Collection: Kafka is highly efficient in collecting large volumes of data (hundreds of records per second) from various sources.
Data Processing: Kafka offers built-in processing capabilities via Kafka Streams, which can be used to process and analyze the data in real time.

Socket vs Kafka: Why is Socket Not Used in Production?

Socket is a combination of an IP address and a port and is typically used for real-time communication.
Drawbacks of Sockets:
- Not a Replayable Source: Once data is consumed, it cannot be retrieved again. In contrast, Kafka retains data for up to 7 days by default.
- No Buffering: Sockets lack buffering capabilities, which can result in data loss if producers are faster than consumers.
Kafka, on the other hand, ensures that producers and consumers are decoupled, and data is retained, which makes it scalable for production environments.

Core Concepts of Apache Kafka

Producer: The application that produces data to Kafka.
- Example: A social media platform like Twitter producing tweets to a Kafka topic.
Consumer: The application that consumes data from Kafka.
- Example: A Spark Streaming application processing Kafka data.
Broker: A server or node that handles the Kafka cluster. Multiple brokers form a Kafka Cluster.
Cluster: A collection of Kafka brokers. For example, a 10-node cluster means there are 10 brokers working together.
Topic: Kafka topics act as data storage containers, similar to tables in a relational database.
- Example: tweets_data, banking_data, employee_data.
Partitions: A way to distribute data across different machines. Large datasets are divided into partitions to improve scalability.
Partition Offset: A unique sequence number that represents a message in the partition. Each message within a partition has a unique offset.
Consumer Groups: Multiple consumers grouped together to process the data in parallel. The load is shared among consumers to prevent a bottleneck.
- Note: The number of consumers cannot exceed the number of partitions.

Kafka as a Data Integration Platform

Kafka excels in data integration by allowing multiple consumers to access the same dataset, unlike traditional messaging systems where data is lost after consumption. Kafka ensures high availability and scalability with its distributed architecture.

Producer: Sends data to a topic.
Consumer: Consumes data from the topic.
Partitioning: Kafka partitions data for scalability across multiple machines.

Kafka Cluster Architecture

Kafka cluster architecture consists of:

Brokers: Individual machines running Kafka.
Topics: Logical grouping of data (e.g., tweets, logs).
Partitions: Subdivisions of topics for horizontal scalability.

Practical Steps for Kafka Practice

1. Create a Cluster

To create a Kafka cluster, you can use platforms like Confluent Cloud.

2. Create a Topic

Use the following command to create a Kafka topic:

kafka-topics --create --topic <topic-name> --partitions <num> --replication-factor <num> --bootstrap-server <server-address>

3. Produce a Message

Kafka allows message production via both UI and CLI.

UI: Go to the Messages tab, select Produce New Message, and input your message.
CLI: Use the following command to produce a message:

kafka-console-producer --broker-list <broker-list> --topic <topic-name>

4. Consume a Message

To consume messages from a topic, use the following command:

kafka-console-consumer --bootstrap-server <server> --topic <topic-name> --from-beginning

Programmatic Approach to Implement a Use Case Using Kafka

Kafka Producer Example: Retail Store

In a retail use case (e.g., Walmart), where each purchase generates a transaction, we can simulate a real-time data stream using Kafka:

Define the Primary Key: Customer ID can be the primary key to ensure that all transactions from the same customer go to the same Kafka partition.

Kafka Producer Code:

from confluent_kafka import Producer

def acked(err, msg):
    if err is not None:
        print('Message delivery failed: {}'.format(err))
    else:
        print('Message delivered to {} [{}]'.format(msg.topic(), msg.partition()))

producer = Producer({'bootstrap.servers': '<server-address>'})

# Sample message: transaction data
producer.produce('<topic-name>', key='customer_id', value='transaction_data', callback=acked)
producer.flush()

Kafka Consumer Example: Retail Store

Read from Kafka:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("KafkaConsumerExample") \
    .getOrCreate()

df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "<server-address>") \
    .option("subscribe", "<topic-name>") \
    .load()

df.show()

Write to Storage (e.g., Delta Table):

df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/path/to/checkpoint") \
    .toTable("orders")

Key Points Summary

Kafka vs Socket: Kafka is scalable, fault-tolerant, and retains data, whereas sockets lack buffering and are not replayable.
Kafka Topics and Partitions: Kafka enables horizontal scalability via partitions, and messages within partitions are ordered by offsets.
Producer and Consumer: The producer publishes data to topics, while consumers read from those topics. Kafka supports multiple consumers consuming the same data.
Consumer Groups: Kafka allows multiple consumers in a group to scale data processing efficiently.