Hadoop 2.3 Centralized Cache feature comparison to Spark RDD

Hadoop 2.3 has two new features.
  • Support for Heterogeneous Storage Hierarchy in HDFS (HDFS-2832)
  • In-memory Cache for data resident in HDFS via Datanodes (HDFS-4949)
This post is related to Centralized cache management feature in HDFS.

It allows you to say at start of your job to cache a particular folder into memory. Applications like Hive , Impala will be able to read data directly from memory which has been cached.  The current features are SCR ( Short Circuit reads ) which allows SCR aware applications directly read from disk by passing Datanode.

Sample command

$hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]


General flow of execution



Comparing with current implementation Spark model , RDD is still superior as it maintains lineage both transformations of writes happening in the in memory data. It means that Spark can write intermediate data to RAM and work faster.

The current HDFS cache management feature does only boosts performance with reads.

So I guess still few mode improvements are needed for Hadoop to beat Spark in performance.

I am very excited to see how downstream systems like Pig , Hive and Impala will use this feature to make them process things faster. I am sure things will get better and better in Hadoop in coming few releases.

No comments:

Post a Comment

Please share your views and comments below.

Thank You.