How Aapche Spark works ( short summary )

Why spark was made

Before Spark was in the market  map reduce was used a processing brain on top of Hadoop.

Apache Spark original research paper

The typical flow for map reduce is


  • read data from disk.
  • Apply some processing
  • Dump intermediary data on disk
  • Do some more processing
  • Show final results


Now if some processing has to be done incrementally by changing some variable over across all data set
The mapreduce will again start from reading from disk and if you the processing 100 times , it will do it 100 times * 2 ( for intermediary processing also )

The Spark solves the following typical use cases where same processing is applied to datasets with varying variable inputs.

Two typical usecases where Spark shines are:

Iterative jobs
Many common machine learning algorithms apply a function repeatedly to the same dataset
to optimize a parameter (e.g., through gradient descent).While each iteration can be expressed as a
MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty.

Interactive analysis
Hadoop is often used to perform ad-hoc exploratory queries on big datasets, through
SQL interfaces such as Pig and Hive.  Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.

To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs).
An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.

Spark provides two main abstractions for parallel programming:

resilient distributed datasets and
parallel operations on these datasets (invoked by passing a function to apply on a dataset).
Which are based on typical functional programming concepts of map , flatmap , filters etc

In addition, Spark supports two restricted types of shared variables that can be used in functions running
on the cluster

Variables in Spark

  • Broadcast variables: Read Only variable
  • Accumulators: These are variables that workers can only “add” to using an associative operation
Example Spark code


val file = spark.textFile("hdfs://...")
val errs = file.filter(_.contains("ERROR"))
val ones = errs.map(_ => 1)
val count = ones.reduce(_+_)

No comments:

Post a Comment

Please share your views and comments below.

Thank You.