Why my Hive Sqoop job is failing

Find few basics about cluster from your Administrator about cluster configuration.

Sample talk Example can be

How many nodes cluster has , what are its configuration

Answer can be

Each node has 120GB RAM . Out of that memory which we can ask for our jobs is about 80GB
We have 14 cpu cores in each datanode , we have 5 right now the maximum we can ask for processing from each datanode is 8 cores

Leaving rest for other processes like OS / Hadoop / Monitoring services

When we run any job in Mapreduce world you will get minimum RAM of 2GB and Max any task can ask for is 80GB ( See the capacity above)

Given the fact that we are running big jobs for one off loads please tell the system to give you higher RAM for your job. ( RAM in increments of 1024 MB)

Besides RAM you can also ask how many CPU cores you want.

The max cores which given node can provide is 8 for processing

This can be controlled via following parameters

Hive jobs


Typically if your jobs fails while inserting to hive query please see check if you need to tune any memory parameter. Hive insert jobs are reduce jobs.

Since you are inserting large amount of data in one go you will face the issues of memory overrun.

Always see the logs , its always mentioned there why the job is failing.

Sqoop jobs

Sqoop jobs spawn only map jobs.

So if Sqoop job is not moving through the indicator is following

Memory issue on our side.

So just add the following parameter in the code

-Dmapreduce.map.memory.mb=5120 -Dmapreduce.map.speculative=false

Tune the above 5120 parameter to based no need.

Where to see logs and job status

You can see whats the status of your job and logs at Resource manager


You can also login to Ambari to see what value has been set as default for given property


Ask for Username and password with readonly access from your administrator

Find out what current default values are


What if , my job is not even accepted by the cluster :)

You are asking for resources which cluster don't have. Means its crossing the max limit of the cluster. So check with your job what its really asking for and what the cluster can provide.

Why is my job being killed ?

If your job is crossing the resource limit which it has originally asked for from the RM the Yarn will kill your job.

You will see something like below in logs

Killing container....

Remember Google is your friend

Moment you see your job has failed see the error from the above logs and search in Google , try to find which parameters people have suggested to change in the job.



Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/maven/cli/MavenCli : Unsupported major.minor version 51.0

I got the below error on Maven

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/maven/cli/MavenCli : Unsupported major.minor version 51.0


Set the Java home 

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home

Git clone via ssh tunnel

Follow the steps mentioned in below post to configure ssh tunnel
Then in git
git config --global http.proxy 'socks5://'

git config --global https.proxy 'socks5://'
Now you can do the git clone

SSH Tunnell

I spent lot of time trouble shooting SSH tunnel problem.
I thought to jot down the notes for my future reference
The dynamic ssh tunnel can be set by simple command
ssh -vvvvv -D port username@remotehost
sss -vvvvv -D 9999 jagatsingh@
Now i firefox use the
Socks 5 proxy setting as
localhost 9999
Keep in mind of unselect all other proxy , e.g HTTP , HTTPS etc
I spent lot of time wasting on this

channel 2: open failed: administratively prohibited: open failed
debug2: channel 2: zombie
Resolution steps
Check on
On the remote host that you have enabled
TCP forwarding.
AllowTCPForwarding yes

The webpage below is good reference

Parallel download for file wget alternative


On mac we can use

brew install aria2

aria2c -x 16 -s 16 http://hortonassets.s3.amazonaws.com/2.2/Sandbox_HDP_2.2_VirtualBox.ova

This will spawn 16 parallel connections



Hadoop conf files in Pivotal Hadoop

Pivotal stores files in different location then the default

Actual binaries can be found under path




Conf files can be found under path




Open source job dependency tools

I was looking for open source Alternatives for job dependency management.

Few things i found

Taskforest is a simple but expressive open-source job scheduler that allows you to chain jobs/tasks and create time dependencies. It uses text config files to specify task dependencies.


schedulix is the Open Source Enterprise Job Scheduling System, which meets the complex requirements of modern IT process automation.


Some other tools popular in Hadoop world


Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.


Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Luigi allows you to run batch types of jobs with complex scheduling via Python code.
By default it supports running Hadoop , MySQL , Scalding , Spark etc jobs.
You can see the list of available configurations here