Hadoop Pig Local mode Tutorial

The below example is explaining how to start programming in Pig.

I followed the book , Programming Pig.

This post assumes that you have already installed PIG in your computer. If you need help you can read the turorial to install pig.

So lets get start to write out first Pig program , using the same code example given in book chapter 2

Download the code examples from github website ( link below)

https://github.com/alanfgates/programmingpig

Pig can run in local mode and mapreduce mode.

When we say local mode it means that source data would be picked from the directory which is local in your computer. So to run some program you would go to the directory where data is and then run pig script to analyze the data.

I downloaded the code examples from above link

Now i go to the data directory where all data is present.

# cd /home/hadoop/Downloads/PigBook/data

Change the path depending upon where you copied the code in your computer

Now lets start pig in local mode

# pig -x local

-x local tells that Dear Pig , lets start working locally in this computer.

The output is similar to below

2012-03-11 11:44:13,346 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/Downloads/PigBook/data/pig_1331446453340.log
2012-03-11 11:44:13,720 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop filesystem at: file:///

It would enter grunt> shell , grunt is the shell to write pig scripts.

Lets try to see all the files which are present in data directory

grunt> ls

Output is shown below


file:/home/hadoop/Downloads/PigBook/data/webcrawl<r 1>    255068
file:/home/hadoop/Downloads/PigBook/data/baseball<r 1>    233509
file:/home/hadoop/Downloads/PigBook/data/NYSE_dividends<r 1>    17027
file:/home/hadoop/Downloads/PigBook/data/NYSE_daily<r 1>    3194099
file:/home/hadoop/Downloads/PigBook/data/README<r 1>    980
file:/home/hadoop/Downloads/PigBook/data/pig_1331445409976.log<r 1>    823

It is showing the list of files which are present in that folder (data)

Lets run on program , In chapter 2 there is one pig script.

Go to PigBook/examples/chap2 folder and there is one script named average_dividend.pig

The code of script is as follows

dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
grouped   = group dividends by symbol;
avg       = foreach grouped generate group, AVG(dividends.dividend);
store avg into 'average_dividend';

In plain english the above code is saying following

Load the NYSE_dividends file in contains fields as exchange, symbol, date, dividend
Group the records in that file by symbol

calculate average for dividend and

store the average results in average_divident folder

Result

After lots of processing the output would look like

 


Input(s):
Successfully read records from: "file:///home/hadoop/Downloads/PigBook/data/NYSE_dividends"

Output(s):
Successfully stored records in: "file:///home/hadoop/Downloads/PigBook/data/average_dividend"

Job DAG:
job_local_0001


2012-03-11 11:47:10,994 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

 

To check the output go to average_dividend directory which is created within data directory ( remember we started pig in this directory)

There is one MR part file part-r-00000 that has the final results

Thats it , PIG latin has done all the magic behind the scene.

Coming next running Pig latin in mapreduce mode

 

 

No comments:

Post a Comment

Please share your views and comments below.

Thank You.