Data Processing in Spark: Batch vs Stream Processing

Overview

Data processing can be broadly classified into two types: Batch Processing and Stream Processing. Both approaches have their use cases, with each suited for different types of data and processing requirements.

1. Batch Processing

Batch processing involves handling historical data collected over time and processing it in groups or batches at a scheduled interval. It is typically used when the data arrives in large chunks at predictable intervals, such as daily or hourly data feeds.

2. Stream/Real-time Processing

Stream processing, on the other hand, deals with the continuous processing of data as it becomes available. Real-time data is processed instantly as it arrives, making it ideal for scenarios requiring quick, on-the-fly calculations.

Batch vs Stream Processing

Feature	Batch Processing	Stream Processing
Data Arrival	Historical data in batches	Continuous data in real-time
Use Case	Periodic processing like daily reports	Real-time analytics like monitoring and alerting
Processing Frequency	Scheduled intervals (e.g., daily, hourly)	Continuous, real-time
Latency	High latency between processing cycles	Low latency (near real-time)

Key Differences: MapReduce vs Apache Spark

MapReduce is a traditional approach to process data in a distributed manner. It is not designed to handle real-time streaming data.
Apache Spark, on the other hand, provides a framework for both Batch and Stream Processing. With Spark, you can process historical data (batch) and continuous data (streaming).

Spark offers Structured Streaming, a module specifically designed to handle real-time streaming data, overcoming MapReduce's limitations.

Two Ways to Handle Streaming Data in Spark

RDD-based Streaming (Spark Streaming APIs)
- Lower-level constructs, suitable for fine-grained control.
- More complicated but flexible for custom processing logic.
DataFrame/SparkSQL-based Streaming (Spark Structured Streaming)
- Higher-level constructs, making the code cleaner and more expressive.
- Preferred approach for most use cases, as it simplifies real-time data processing.

Example Use-Case: Spark Structured Streaming

Use-Case: Find Trending Hashtags

Producer: A continuous data stream generated by an application (e.g., Twitter).
Consumer: A Spark Structured Streaming application that processes the data.

Step 1: Set Up a Dummy Producer

To simulate the real-time scenario, we use a socket to continuously stream data.

Open a terminal and use the command nc -lk 9998 to listen on port 9998.

Step 2: Start the Consumer (Spark Structured Streaming)

In another terminal, run the Spark Structured Streaming application to consume data.

Note: Each time data is entered in the producer terminal, it triggers the consumer to process and reflect results in real-time.

Challenges of Stream Processing

Back Pressure: When processing multiple batches in quick succession, jobs may overlap and cause performance issues.
Late Arriving Records: Records that arrive after their expected time (due to network issues, for instance) can cause delays.
Job Failures: Handling failures in batch jobs and ensuring fault tolerance in streaming.
Incremental Processing: To optimize performance, only new or changed data is processed, rather than recalculating previous results.

Example Use-Case: Retail Sales Monitoring

Consider a retail chain with 500 stores and 5000 billing counters. The business requirement is to calculate total sales every 15 minutes. With streaming, real-time calculations could face challenges such as back pressure and late-arriving records.

Addressing Challenges

Back Pressure: If a job is still processing Batch 1, Batch 2 might not be processed in time.
Late Arriving Records: A record generated at 10:08am but arriving at 10:20am would delay processing.

In Spark Structured Streaming, these challenges are managed with fault tolerance, incremental approaches, and stateful operations.

Spark Structured Streaming - Word Count Example

3 Major Steps in a Spark Streaming Application:

Read the Data
Process the Data
Write the Data

Code Example: Word Count

# Step 1: Create a Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WordCount").getOrCreate()

# Step 2: Read the Data (from socket in this case)
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

# Step 3: Processing Logic - Split lines into words and count them
from pyspark.sql.functions import explode, split
words = lines.select(explode(split(lines.value, " ")).alias("word"))
wordCounts = words.groupBy("word").count()

# Step 4: Write the Data to Console
query = wordCounts.writeStream.outputMode("append").format("console").option("checkpointLocation", "checkpointdir1").start()
query.awaitTermination()

Execution Steps:

Producer: Start Netcat (nc -lk 9999) to simulate data stream.
Consumer: Run the Spark Structured Streaming code.
Processing: Data is processed in micro-batches and displayed on the console in real-time.

Input Sources for Streaming Data

File Source: Used to read files in a directory as a stream (supports text, CSV, JSON, ORC, PARQUET).
Kafka Source: Reads data from Kafka topics.
Socket Source: Reads data from a socket (used mostly for testing).

Example: Reading Data from a JSON File Source

# Reading JSON data
lines = spark.readStream.format("json").schema(orders_schema).option("path", "data/inputdir").load()

# Processing Logic
orders_df = lines.createOrReplaceTempView("orders")
completed_orders = spark.sql("SELECT * FROM orders WHERE order_status = 'COMPLETE'")

# Writing to the Sink (CSV format)
query = completed_orders.writeStream.format("csv").outputMode("append").option("path", "data/outputdir").option("checkpointLocation", "checkpointdir1").start()
query.awaitTermination()

Output Modes in Spark Streaming

Complete Mode: Writes the entire updated result every time a new file is added.
Update Mode: Only writes updated or inserted rows.
Append Mode: Appends only new records without modifying existing records (does not support aggregation).

Fault Tolerance with Checkpointing

Spark Structured Streaming supports fault tolerance via checkpoints. These checkpoints store metadata and processed data states, allowing Spark to recover from failures without reprocessing everything from scratch.

Checkpoint Directory: Keeps the state of the job.
- On failure, Spark can resume from the last checkpoint, avoiding reprocessing.

State Store

State stores manage state across multiple micro-batches in stateful transformations (e.g., groupBy). These operations require maintaining state between batches to compute final aggregates.

Stateful Transformations: Operations like groupBy where intermediate results need to be saved.

Advanced Features: Tables and Data Streams

From Spark 3.1 onwards, Spark allows reading from and writing to tables as streaming data frames.

DataStreamReader.table() - Reads data from tables as streaming data frames.
DataStreamWriter.toTable() - Writes streaming data frames to tables.

Summary

Batch Processing is useful for handling large chunks of historical data at scheduled intervals.
Stream Processing enables real-time processing of continuously arriving data.
Apache Spark provides powerful tools for both batch and stream processing, with structured streaming handling real-time data efficiently.
Fault Tolerance is achieved through checkpointing, ensuring recovery from failures.
Stateful Transformations are supported via the state store for operations like aggregations.

This document covers the key aspects of Spark's streaming capabilities, including batch vs. stream processing, structured streaming, and practical examples with code.

Data Processing - Batch vs Stream