Data integrity in hadoop

Hadoop maitains data integrity by regular checksums of the same.

Starting Hadoop version 0.23.1 the default checksum algorithm has been changed to CRC32C. It is more efficient version than CRC32 which has been used in previous versions. You can read more about this change at JIRA HDFS-2130

Now coming to how it works


The property io.bytes.per.checksum controls for how many bytes the check CRC code is calculated. In the new version of hadoop this property has been changed to dfs.bytes-per-checksum . The default value of this is 512 bytes.It should not be larger than dfs.stream-buffer-size

Data nodes calculates the checksum for data it receives and raises CheckSumException if something is wrong

Whenever client reads data from DN , it responds back that checksum has been verfied to be correct and DN also updates the checksum log present in itself keeping info about when was the last time when checksum for the block was verified

Every DN also regularly runs DataBlockScanner which verifies the data stored in it for health.

Whenever a corrupted block is found NN starts replication of it from some healthy block so that required replication factor ( 3 by default , it means each block has 3 copies in HDFS) can be achieved.

While reading any file from HDFS using FileSystem API , we can also tell that we dont want to verify the checksum (for some reason) during this transfer.

Hadoop API

/hadoop-0.23.1/share/doc/hadoop/api/org/apache/hadoop/fs/FileSystem.html#setVerifyChecksum(boolean)

void     setVerifyChecksum(boolean verifyChecksum)
        Set the verify checksum flag.


No comments:

Post a Comment

Please share your views and comments below.

Thank You.