Oozie data load assurance

While using Oozie workflows to import data to Cluster for load assurance we should make sure that workflow is importing data to cluster regularly.

For File based loads

After each workflow run use

boolean fs:dirSize(String path)

To see that workflow has imported something or not ?

If not shoot an email using Email action and logging them.

For Sqoop based loads

Since Sqoop based loads don't support counters yet ( Oozie 3.2)

See OOZIE-1012 Sqoop jobs are unable to utilize Hadoop Counters

Use the following to find if something new has been created or not

Get list of all files in Sqoop import directory. Check for the files with creation stamp greater than start of oozie job. And then see its size

Corresponding methods are

FileStatus stat = fileSys.getFileStatus(file1);
long ctime1 = stat.getCreationTime();

You can also count number of records imported and match with source using sqoop eval

If Sqoop job is importing files to new directory every time then using simple logic like dirSize is also enough.

Further logging them to some central database for tracking daily loads :)

------

Was it helpful ? You know better way ?

Share your thoughts below in comments , Thanks for reading

Import Hadoop Code to Eclipse

Setting up eclipse for Hadoop development is bit time consuming with no clear instructions available online

Just trying to share what i know :)

You should do it on some Linux machine ideally. I used Windows just to show the steps (PS : Ubuntu i still love you )

Also read the Building.txt in the source code to know more about what all you need to build.

Here we go :)

Download m2e plugin in eclipse

Download m2e subversive connector

You should also have

SVN kit 1.3.5

Native Java HL 1.6

While this import is being done if eclipse asks you to allow him to download something , let him :) , Keep eclipse happy :)

Open Eclipse

Click Import

Maven > Import existing project from SVN

image

 

Select SCM type as SVN from drop menu ( If you don't have this then you need to install connector as told above)

URL as

http://svn.apache.org/repos/asf/hadoop/common/trunk/

 

image

Click next

Choose destination directory ( Optional )

Hit the import button and relax for some time :)

After some time you would see this

image

 

Click Next

 

image

Done

After this eclipse will take some time to build your workspace , resolve dependencies all that magical maven stuff :) The guy who wrote maven was wonderful :)

A long list of Hadoop projects will get fit into Eclipse project list on left :)

Do a quick refresh , That’s it you are ready to play with Hadoop code.

From time to time do a update of the project so that you are always updated with line of fire Trunk code :)  and know what those awesome people in apache are doing.

Was it helpful ? You know better way ?

Share your thoughts below in comments , Thanks for reading

There are 2 problems at this moment , but right now i don't have time to check them. Just writing them for my reference

1) Some build issues and Java compatibility.

2) Code is not showing up as java packages but as folders

Update on Issues above , This problem is that eclipse is not able to detect that particular project as Java project. I saw the issue with pom.xml artifact jdk.tools:jdk.tools:jar:1.6 is missing. So fixing this should resolve , so both 1) and 2) are related  :)

Ubuntu repository no address associated with hostname

While I setup this new machine I often got error while doing update

no address associated with hostname

Workaround

System Settings >Network > Wired> Options > IP 4 Settings

Choose Method
Automatic DHCP address only
DNS servers 8.8.8.8

We are using Google DNS server :)

You can read more about this message here
http://ubuntuforums.org/showthread.php?t=1475399&page=2

HDFS Federation in Cluster

HDFS Federation was one of the new featruees introduced in Hadoop recently.

If you have read the instructions present on Apache Website

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html

It guides us to configure the Hadoop Federation.

Some of the additional things which are required are telling clients

In core-site.xml

fs.defaultFs
 viewFs://ClusterID

Above means that fs.defaultFs should have value of your cluster ID

Additionally we have to tell Clients where to go for which path

fs.viewds.mounttable.MyNewClusterID.link./PATH
 hdfs://NN-host1:port/PATH

The above means that for this path on cluster id go to this namenode and path

e.g

fs.viewds.mounttable.MyNewClusterID.link./fruits
 hdfs://10.10.20.10:8020/fruits

Rest you follow the instructions given on apache website

Besides this there are other few good tutorials about Federation

http://blog.cloudera.com/resource/hadoop-world-2011-presentation-video-hdfs-federation/
http://www.slideshare.net/huguk/hdfs-federation-hadoop-summit2011