To test if file or directory exists in HDFS

If try to use normal shell syntax like

if [ hadoop fs -test -d /dev/pathToTest -ne 0 ]; then
echo “Creating  directory”
fi

Then it does not works.

Reason being the result of the test is reported in the shell exit code, not as a textual output
from the command, just like the UNIX /usr/bin/test command.

The correct usage is given as below

if hadoop fs -test –d /dev/pathToTest ; then
echo "Directory  exists"
else
echo “Creating  directory”
fi

if hadoop fs -test –e /dev/pathToTest/file.html ; then
echo "File  exists"
else
echo “File does not exists ”
fi

### Install Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File

This post is part of series of post which explains how to enable Kerberos on Hadoop Cluster.

To use AES 256 encryption in Kerberos you must install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File in each of the host of your cluster.

Process is straight forward

Extract the Zip

Copy the files

local_policy.jar
US_export_policy.jar

In all machines at following location

JAVA_HOME/jre/lib/security

Take appropriate path as per your configuration above

Verify aes256-cts:normal is present in supported_enctypes field of the kdc.conf or krb5.conf file.

After changing the kdc.conf file, you'll need to restart both the KDC and the kadmin server for those changes to take

To do bench marks of Hadoop cluster is an ongoing process as we use it inside the organization.

The main thing which we don't know when we buy new cluster is how this new power house of machine will behave for various different sets of workloads.

Intel who is also working on its own flavor of hadoop has product to do Benchmark of cluster performance against different types of workloads.

Micro Benchmarks:

1. Sort (sort)

This workload sorts its text input data, which is generated using the Hadoop RandomTextWriter example.

2. WordCount (wordcount)

This workload counts the occurrence of each word in the input data, which are generated using the Hadoop RandomTextWriter example. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.

3. TeraSort (terasort)

TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.

HDFS Benchmarks:

4. enhanced DFSIO (dfsioe)

Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster.

Web Search Benchmarks:

5. Nutch indexing (nutchindexing)

Large-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The workload uses the automatically generated Web data whose hyperlinks and words both follow the Zipfian distribution with corresponding parameters. The dict used to generate the Web page texts is the default linux dict file /usr/share/dict/linux.words.

6. PageRank (pagerank)

The workloads contains an implementation of the PageRank algorithm on Hadoop (a search engine ranking benchmark included in pegasus 2.0). The workload uses the automatically generated Web data whose hyperlinks follow the Zipfian distribution.

Machine Learning Benchmarks:

7. Mahout Bayesian classification (bayes)

Large-scale machine learning is another important use of MapReduce. This workload tests the Naive Bayesian (a popular classification algorithm for knowledge discovery and data mining) trainer in Mahout 0.7, which is an open source (Apache project) machine learning library. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.

8. Mahout K-means clustering (kmeans)

This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Mahout 0.7. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.

Data Analytics Benchmarks:

9. Hive Query Benchmarks (hivebench)

This workload is developed based on SIGMOD 09 paper "A Comparison of Approaches to Large-Scale Data Analysis" and HIVE-396. It contains Hive queries (Aggregation and Join) performing the typical OLAP queries described in the paper. Its input is also automatically generated Web data with hyperlinks following the Zipfian distribution.

### Google Search only university domains

Every time Google amazes me with its power , intelligence to do stuff.

I was searching some time back about some topic and I wanted to see only specific university sites. I just wanted to know what is happening in higher education for that field.

Just a quick search

~human science site:.edu

~ tells Google to include Queries similar to human science
Only specific domains .edu , tells to search only university domains.

### Export MySQL data to csv

MySQL data can be exported to CSV and many other formats using MySQL Workbench

You can use command as

Go to place where you installed workbench say

C:\Program Files\MySQL\MySQL Workbench 5.2 CE\utilities>

Run the command below

What above command is saying is

Create csv file for per table data
Create in csv format
Export both data and Table definitions

It would create csv files in the utilities folder.

You can read more about command at link above. You can easily specify the output display format. Permitted format values are sql, grid, tab, csv, and vertical. The default is sql.

### How to learn Japji Sahib

Waheguru ji

I have been trying to learn Japji sahib since sometime now , but i get confused with pauri orders.

So this is what i have planned. I am going to post progress of this effort also how it goes.

Divide and Conquer

This rule has been used since ages to solve all the tasks of the world. To sub divide the tasks of the world. Guru ji already has divided the tasks in pauris for us. So i am going to make sure i remember each of them correctly. So what i am going to do is while i see one line i will hide next line from my view and would try to recite it by my heart. Then i would scroll little down to see if i was correct in next line. I would scroll only one line at a time , comparing what was there what i recited , correcting my mistakes for recitation and repeating if required. In this way i would try to make sure i learn all individual pauris with correct recitation.

You can do the above steps online at

http://www.sikhiwiki.org/index.php/Learn_all_of_Japji_Sahib#Salok

Click on compiled Banis section.

Click on Japji sahib and then show yourself only one line at a time as discussed above.

Note on your notebook where you are doing mistake , so that next time you become extra cautious that you are making mistake there.

Giving hint to myself

I have been this method using in my school studies remembering long long answers (this is relevant if you studied in India). Or if you have worked in theater plays then you must have learned this art. You remember your dialogue with dialogue ending of your previous speaker.

So i planned to arrange and learn all the first lines of pauris so that i can have hint which pauri is coming next. I have written the first lines of all of them. If you want the sheet which i made you can see it here online.

It starts from Pauris

Listen Audio of Japji Sahib all time

For many years i have been listening Audio by Tarlochan singh. His voice is so familiar in all Indian houses. (His voice has been embedded deep in my heart with memories from as old as when i was 8 years old , it was played in my neighbours place every evening Rehraas Sahib and those memories are still fresh in my mind) If you want to get this you can get if from Youtube. I have copied it in my mobile. You can download it from Internet. If you want this just write to me i can share it. In my mobile i have just placed Japji sahib on repeat and it repeats all times when it finishes. By listening we learn things fast , that's why you must have noticed we remember songs all lyrics ( Although i cannot remember , to be honest) . But listening gives you one more input source to mind besides reading by eyes.

Last and most important

Learn Meanings of Japji Sahib

I have downloaded the english translation of Japji sahib so that i know what it is written. Its common saying in Punjabi ratta jaldi bhul janda hai , so i want to learn by understanding rather than just mugging them up. Guru Nanak dev ji ne vi likhya hai

Vidhya vichari ta paropkari

Sanu sab kuch samahj ke vichar ke karna chahida hai

One more idea which i have is

If you want to mix the steps 2 and 3 i would recommend to use the below Youtube video. It shows meaning also besides you can always listen to Audio

Waheguru ji mehar karna gurbani te gurubani de arth man andar was jaan.

Do post your experiences and suggestions. I would keep you updated.

### Oozie day light savings example

Although Oozie documentation explains how to handle Day light savings with Oozie

Timezone : Add the timezone for place with which you want to run the job.

Example every day in Sydney 5:00 PM I want a job to be run. Since Sydney observe day light savings I would enter coordinator job as

Starttime: GMT time corresponding to my start time ( let’s say time matching to Sydney 5:00PM )    Start time = 2012-10-03T07:00Z
End time : GMT time corresponding to my end time   End time = 2012-10-07T07:00Z
Timezone : Australia/Sydney
Frequency : coord:day(1)

Always use function to specify frequency instead of hard coding minutes.
See the time zone ID for region you want to use. Hint see http://www.java2s.com/Tutorial/Java/0120__Development/GettingallthetimezonesIDs.htm

The above configuration will make your job run tension free in day light savings area also

Hope this helps

### Null behaviour in Sqoop and Hive

Sqoop by default imports the null values with string null as output.

So any record having null in data would be shown like below in HDFS

|N|null|Jagat Singh|BigData

The issue with such kind of import is that we cannot write hive queries like show me all records where the column is not null.

To know better we should keep in mind that the Hive default representation for null is

\N

So to make Hive treat the record as null should be imported as \N