### Sample data for practice with Hadoop

Go to http://www.infochimps.com/datasets

Filter by Free data sets available

Choose the data type which is of interest to you.

### Pig Editor for eclipse

I was not able to configure PigPen which is being told as better feature rich editor then this.

Would be trying to get hold of PigPen also.

Incase you want to use PigEditor here is what you have to do

In Eclipse Update site use the URL

Let it install

Restart Eclipse

Create a New General project

It would ask you to allow XUnit use with project , say yes and you are done.

### Pig Return codes

Pig Return codes

Value        Meaning            Comment
0        Success
1        Failure         Would retry again
2        Failure
3        Partial Failure        Used with multiquery
4        Illegal argument
5        IOException throws    UDF raised exception
6        PigException        Python UDF raised exception
7        ParseException         Can happen after variable parsing if variable substitution is being done
8        Throwable        Unexcepted exception

### Apache Pig Introduction Tutorial

Apache Pig is a platform to analyze large data sets.

In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data.

Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data.

Pig consists of two parts

• Pig latin language
• Pig engine

Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored.

The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs.
We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data

The job of engine is to exectute the data flow written in Pig latin in parallel on hadoop infrastructure.

Why Pig is required when we can code all in MR

Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin
In MR we have to lots of manual coding.

Pig does optimization of Pig latin scripts while creating them into MR jobs.
It creates optimized version of Map reduce to run on hadoop

It takes very less time to write Pig latin script then to write corresponding MR code

Where Pig is useful

Transactional ETL Data pipelines ( Mostly used)
Research on raw data
Iterative processing

### Oracle Date mapped to TimeStamp while importing with Sqoop

Oracle Date mapped to TimeStamp while importing with Sqoop

The current version of Sqoop 1.4.1 maps the Oracle Date to Timestamp since Oracle drives does this. Read the discussion below.

http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-faq-090281.html#08_01

How to solve this

While you are importing with sqoop pass on driver specific arguments as example below

$sqoop import -D mapDateToTimestamp=false --connect jdbc:oracle:thin:@//db.example.com/foo --table bar The above property mapDateToTimestamp to false would make the driver will revert to the default 9i-10g behavior and map DATE to Date. ### Installing Pig ( Apache Hadoop Pig) Apache Pig can be downloaded from http://pig.apache.org Download the latest release from its website Unzip the downloaded tar file Set the environment variables in your system as export PIG_HOME="/home/hadoop/software/pig-0.9.2" export PATH=$PATH:$PIG_HOME/bin Set the place where you downloaded pig and also set its path If you plan to run Pig on hadoop cluster then one additional variable needs to be set export PIG_CLASSPATH="/home/hadoop/software/hadoop-1.0.1/conf" It tells about the place where to look for hdfs-site.xml and other configuration files for hadoop Restart your computer Thats it , now lets test the installation On the command prompt type # pig -h it shoud show the help related to Pig , and its various commands. Done :) , Next you can read about How to run your first Pig script in local mode Or about various Pig running modes ### Hadoop Pig Local mode Tutorial The below example is explaining how to start programming in Pig. I followed the book , Programming Pig. This post assumes that you have already installed PIG in your computer. If you need help you can read the turorial to install pig. So lets get start to write out first Pig program , using the same code example given in book chapter 2 Download the code examples from github website ( link below) https://github.com/alanfgates/programmingpig Pig can run in local mode and mapreduce mode. When we say local mode it means that source data would be picked from the directory which is local in your computer. So to run some program you would go to the directory where data is and then run pig script to analyze the data. I downloaded the code examples from above link Now i go to the data directory where all data is present. # cd /home/hadoop/Downloads/PigBook/data Change the path depending upon where you copied the code in your computer Now lets start pig in local mode # pig -x local -x local tells that Dear Pig , lets start working locally in this computer. The output is similar to below 2012-03-11 11:44:13,346 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/Downloads/PigBook/data/pig_1331446453340.log 2012-03-11 11:44:13,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop filesystem at: file:/// It would enter grunt> shell , grunt is the shell to write pig scripts. Lets try to see all the files which are present in data directory grunt> ls Output is shown below file:/home/hadoop/Downloads/PigBook/data/webcrawl<r 1> 255068 file:/home/hadoop/Downloads/PigBook/data/baseball<r 1> 233509 file:/home/hadoop/Downloads/PigBook/data/NYSE_dividends<r 1> 17027 file:/home/hadoop/Downloads/PigBook/data/NYSE_daily<r 1> 3194099 file:/home/hadoop/Downloads/PigBook/data/README<r 1> 980 file:/home/hadoop/Downloads/PigBook/data/pig_1331445409976.log<r 1> 823 It is showing the list of files which are present in that folder (data) Lets run on program , In chapter 2 there is one pig script. Go to PigBook/examples/chap2 folder and there is one script named average_dividend.pig The code of script is as follows dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend); grouped = group dividends by symbol; avg = foreach grouped generate group, AVG(dividends.dividend); store avg into 'average_dividend'; In plain english the above code is saying following Load the NYSE_dividends file in contains fields as exchange, symbol, date, dividend Group the records in that file by symbol calculate average for dividend and store the average results in average_divident folder Result After lots of processing the output would look like Input(s): Successfully read records from: "file:///home/hadoop/Downloads/PigBook/data/NYSE_dividends" Output(s): Successfully stored records in: "file:///home/hadoop/Downloads/PigBook/data/average_dividend" Job DAG: job_local_0001 2012-03-11 11:47:10,994 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! To check the output go to average_dividend directory which is created within data directory ( remember we started pig in this directory) There is one MR part file part-r-00000 that has the final results Thats it , PIG latin has done all the magic behind the scene. Coming next running Pig latin in mapreduce mode ### org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found If you are using sqoop 1.4.1 and you try to build it you can get error as org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found This is due to reason that HBase 0.92.0 has been released Just make the following changes in build.xml and run the build again https://reviews.apache.org/r/4169/diff/ ### Sqoop free form query example$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --target-dir importOnlyEmpName -e 'Select Name from Employee_Table where $CONDITIONS' --m 1 free form query is presented after -e or -query We can write our query in single quotes or double quotes. Just read the notes below from official documentation. sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --target-dir importOnlyEmpName -e "Select Name from Employee_Table where (employee_Name="David" OR Salary>'2000') AND \$CONDITIONS" --m 1

Example of Sqoop free form query with where clause

The above query is selecting just name from the table Employee_Table which has other columns also besides name.
Importance of $CONDITIONS in free form query Its worth nothing the importance of$CONDITIONS in free form query ( this thread explains well , getting info from there)
If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on. Sqoop does not parse your SQL statement into an abstract syntax tree which would allow it to modify your query without textual hints. You are free to add further constraints like you suggested in your initial example (read the thread), but the literal string "$CONDITIONS" does need to appear in the WHERE clause of your query so that Sqoop can textually replace it with its own refined constraints.
Setting -m 1 is the only way to force a non-parallel import. You still need $CONDITIONS in there because it queries the database about column type information, etc in the client before executing the import job, but does not want actual rows returned to the client. So it will execute your query with$CONDITIONS set to '1 = 0' to ensure that it receives type information, but not records.
Notes from Sqoop documentation
If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS" The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results. ### Sqoop --target-dir example Example for Import using Sqoop in target directory in the HDFS$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1

The above command will import the data present in Employee_Table in sqoop database to the directory named employeeImportAll directory

After import is done we can check if data is present

Just see the output for each of the 3 commands one by one.

hadoop@jj-VirtualBox:~$hadoop fs -ls hadoop@jj-VirtualBox:~$ hadoop fs -ls /user/hadoop/employeeImportAll
hadoop@jj-VirtualBox:~$hadoop fs -cat /user/hadoop/employeeImportAll/part-m-00000 All the results are present as comma separated file ### ERROR tool.ImportTool: Error during import: No primary key could be found for table 12/03/05 23:44:31 ERROR tool.ImportTool: Error during import: No primary key could be found for table Employee_Table. Please specify one with --split-by or perform a sequential import with '-m 1'. Sample queryon which i got this error$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll

Explanation

While performing the parallel imports Sqoop needs a criterion by which it can split the workload.Sqoop uses the splitting column to split the workload. By default Sqoop will identify the primary key column (if present) in a table to use as the splitting column.

The low and high values of splitting column are retrieved from databases and the map tasks operate on evenly sized components of total range.

For example , if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

Solution

$sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1 Just add the --m 1 , it tells to use sequential import with 1 mapper Or another solution can be by telling to sqoop to use particulay column as split column.$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll  --split-by columnName

Spring Hadoop provides support for writing Apache Hadoop applications that benefit from the features of Spring, Spring Batch and Spring Integration.

### Features

• Extension to Spring Batch to support creating an end-to-end data pipeline solution
• Simplified reading and writing to HDFS using Spring's resource abstraction
• Spring Batch Tasklets for Map-Reduce an Streaming Jobs
• Integration with Cascading, HBase, Hive and Pig

### Cloudera Certified Administrator for Apache Hadoop (CCAH) exam topics and syllabus

Update : 6 April 2013

To earn a CCAH certification, candidates must pass an exam designed to test a candidate’s fluency with the concepts and skills required in the following areas:

If you are interested in Developer exam then you should read other post

http://jugnu-life.blogspot.in/2012/03/cloudera-certified-developer-for-apache.html

Details for Admin exam are here along with from where to prepare

Number of Questions: 60
Time Limit: 90 minutes
Passing Score: 70%
Languages: English, Japanese
English Release Date: November 1, 2012
Japanese Release Date: December 1, 2012
Price: USD $295, AUD285, EUR225, GBP185, JPY25,500 #### 1. HDFS (38%) ###### Objectives • Describe the function of all Hadoop Daemons • Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing. • Identify current features of computing systems that motivate a system like Apache Hadoop. • Classify major goals of HDFS Design • Given a scenario, identify appropriate use case for HDFS Federation • Identify components and daemon of an HDFS HA-Quorum cluster • Analyze the role of HDFS security (Kerberos) • Describe file read and write paths ###### Section Study Resources #### 2. MapReduce (10%) ###### Objectives • Understand how to deploy MapReduce MapReduce v1 (MRv1) • Understand how to deploy MapReduce v2 (MRv2 / YARN) • Understand basic design strategy for MapReduce v2 (MRv2) ###### Section Study Resources • Apache YARN docs (note: we don't control apache.org links and as of 11 February 2013, they have been experiencing downtime. You may get a 404 error.) • CDH4 YARN deployment docs #### 3. Hadoop Cluster Planning (12%) ###### Objectives • Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster. • Analyze the choices in selecting an OS • Understand kernel tuning and disk swapping • Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario • Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including CPU, memory, storage, disk I/O • Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster • Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario ###### Section Study Resources • Hadoop Operations: Chapter 4 #### 4. Hadoop Cluster Installation and Administration (17%) ###### Objectives • Given a scenario, identify how the cluster will handle disk and machine failures. • Analyze a logging configuration and logging configuration file format. • Understand the basics of Hadoop metrics and cluster health monitoring. • Identify the function and purpose of available tools for cluster monitoring. • Identify the function and purpose of available tools for managing the Apache Hadoop file system. ###### Section Study Resources • Hadoop Operations, Chapter 5 #### 5. Resource Management (6%) ###### Objectives • Understand the overall design goals of each of Hadoop schedulers. • Understand the role of HDFS quotas. • Given a scenario, determine how the FIFO Scheduler allocates cluster resources. • Given a scenario, determine how the Fair Scheduler allocates cluster resources. • Given a scenario, determine how the Capacity Scheduler allocates cluster resources. ###### Section Study Resources #### 6. Monitoring and Logging (12%) ###### Objectives • Understand the functions and features of Hadoop’s metric collection abilities • Analyze the NameNode and JobTracker Web UIs • Interpret a log4j configuration • Understand how to monitor the Hadoop Daemons • Identify and monitor CPU usage on master nodes • Describe how to monitor swap and memory allocation on all nodes • Identify how to view and manage Hadoop’s log files • Interpret a log file ###### Section Study Resources #### 7. The Hadoop Ecosystem (5%) ###### Objectives • Understand Ecosystem projects and what you need to do to deploy them on a cluster. ###### Section Study Resources • ### Cloudera Certified Developer for Apache Hadoop Syllabus exam topics and contents (CCDH) Cloudera Certified Developer for Apache Hadoop (CCDH) Update : 6 April 2013 Cloudera has added exam learning resources on the website , please read this link for latest. http://university.cloudera.com/certification/prep/ccdh.html http://jugnu-life.blogspot.in/2012/05/cloudera-hadoop-certification-now.html Syllabus , exam contents http://university.cloudera.com/certification.html To earn a CCDH certification, candidates must pass an exam designed to test a candidate’s fluency with the concepts and skills required in the following areas: If you are interested in Administrator exam then you should read other post http://jugnu-life.blogspot.in/2012/03/cloudera-certified-administrator-for.html Exam syllabus for Developer and Study sources are mentioned below. #### 1. Core Hadoop Concepts (CCD-410:25% | CCD-470: 33%) ###### Objectives • Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing under both CDH3 and CDH4. • Understand how Apache Hadoop exploits data locality, including rack placement policy. • Given a big data scenario, determine the challenges to large-scale computational models and how distributed systems attempt to overcome various challenges posed by the scenario. • Identify the role and use of both MapReduce v1 (MRv1) and MapReduce v2 (MRv2 / YARN) daemons. ###### Section Study Resources #### 2. Storing Files in Hadoop (7%) ###### Objectives • Analyze the benefits and challenges of the HDFS architecture • Analyze how HDFS implements file sizes, block sizes, and block abstraction. • Understand default replication values and storage requirements for replication. • Determine how HDFS stores, reads, and writes files. • Given a sample architecture, determine how HDFS handles hardware failure. ###### Section Study Resources • Hadoop: The Definitive Guide, 3rd edition: Chapter 3 • Hadoop Operations: Chapter 2 • Hadoop in Practice: Appendix C: HDFS Dissected #### 3. Job Configuration and Submission (7%) ###### Objectives • Construct proper job configuration parameters • Identify the correct procedures for MapReduce job submission. • How to use various commands in job submission ###### Section Study Resources • Hadoop: The Definitive Guide, 3rd Edition: Chapter 5 #### 4. Job Execution Environment (10%) ###### Objectives • Given a MapReduce job, determine the lifecycle of a Mapper and the lifecycle of a Reducer. • Understand the key fault tolerance principles at work in a MapReduce job. • Identify the role of Apache Hadoop Classes, Interfaces, and Methods. • Understand how speculative execution exploits differences in machine configurations and capabilities in a parallel environment and how and when it runs. ###### Section Study Resources • Hadoop in Action: Chapter 3 • Hadoop: The Definitive Guide, 3rd Edition: Chapter 6 #### 5. Input and Output (6%) ###### Objectives • Given a sample job, analyze and determine the correct InputFormat and OutputFormat to select based on job requirements. • Understand the role of the RecordReader, and of sequence files and compression. ###### Section Study Resources • Hadoop: The Definitive Guide, 3rd Edition: Chapter 7 • Hadoop in Action: Chapter 3 • Hadoop in Practice: Chapter 3 #### 6. Job Lifecycle (18%) ###### Objectives • Analyze the order of operations in a MapReduce job. • Analyze how data moves through a job. • Understand how partitioners and combiners function, and recognize appropriate use cases for each. • Recognize the processes and role of the the sort and shuffle process. ###### Section Study Resources • Hadoop: The Definitive Guide, 3rd Edition: Chapter 6 • Hadoop in Practice: Techniques in section 6.4 Two blog posts from Philippe Adjiman’s Hadoop Tutorial Series #### 7. Data processing (6%) ###### Objectives • Analyze and determine the relationship of input keys to output keys in terms of both type and number, the sorting of keys, and the sorting of values. • Given sample input data, identify the number, type, and value of emitted keys and values from the Mappers as well as the emitted data from each Reducer and the number and contents of the output file(s). ###### Section Study Resources • Hadoop: The Definitive Guide, 3rd Edition: Chapter 7 on Input Formats and Output Formats • Hadoop in Practice: Chapter 3 #### 8. Key and Value Types (6%) ###### Objectives • Given a scenario, analyze and determine which of Hadoop’s data types for keys and values are appropriate for the job. • Understand common key and value types in the MapReduce framework and the interfaces they implement. ###### Section Study Resources • Hadoop: The Definitive Guide, 3rd Edition: Chapter 4 • Hadoop in Practice: Chapter 3 #### 9. Common Algorithms and Design Patterns (7%) ###### Objectives • Evaluate whether an algorithm is well-suited for expression in MapReduce. • Understand implementation and limitations and strategies for joining datasets in MapReduce. • Analyze the role of DistributedCache and Counters. ###### Section Study Resources • Hadoop: The Definitive Guide, 3rd Edition: Chapter 8 • Hadoop in Practice: Chapter 4, 5, 7 • MapReduce Algorithms tutorial video. Note: uses the old API. • Hadoop in Action: Chapter 5.2 #### 10. The Hadoop Ecosystem (8%) ###### Objectives • Analyze a workflow scenario and determine how and when to leverage ecosystems projects, including Apache Hive, Apache Pig, Sqoop and Oozie. • Understand how Hadoop Streaming might apply to a job workflow. ###### Section Study Resources ### Hadoop Certification in India or Outside USA If you waiting to write Cloudera exam for Hadoop certification , then there is good news for you. Cloudera is going to organize exams through Pearson VUE center starting 1 May 2012 throughout the world. Exams : Cloudera Certified Administrator for Apache Hadoop (CCAH) and Cloudera Certified Developer for Apache Hadoop (CCDH) Start date : 1 May 2012 Testing center : Pearson VUE Exam fees :$295 US

Another good news is that , its no more necessary to attend training prior to writing exam in Hadoop world. So that huge training fees can be avoided if we study on our own. (At least I cannot afford to pay (1600 USD)that training cost its huge , Cloudera people are you listening ? USD 1600 is huge 1 USD = 50 INR)

This is great news for many people in India or around the world outside USA who wanted to write certification exam.

More details at official press release below.

If you are also planning to write exam like me , lets plan and study together.

How are you working out for them ?

In which technologies you are working these days , i am working for MR , HIVE , Sqoop these days and following Tom White book on hadoop

Do you have idea about contents of exam and syllabus?

I have written about the contents of the exam here in two blog posts

http://jugnu-life.blogspot.in/2012/03/cloudera-certified-developer-for-apache.html

### Setting up development environment for Sqoop

The post at offical Wiki of Sqoop explains well the process to setup development environment for Sqoop.

I am having following

Ubuntu system 11.10
Ant 1.8 in my system
Sbsclipse ( svn plugin for eclipse)
Make is already present in Ubuntu
Asciidoc i downloaded from Software repository in Ubuntu , easy part :)
Java 1.6 is already there in my system

All set :)

### Snappy compressions library

Snappy is a compression / decompression library build using C++

The main advantage of Snappy is the high speed in compressing or decompressing the data.

### Integrating Pig and Accumulo

Accumulo

Accumulo is a distributed key/value store that provides expressive, cell-level access labels.

Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift.

http://www.covert.io/post/18605091231/accumulo-and-pig

The above post explains use of Pig and Accumulo together.

### Sqoop import with where clause

If you are following from previous sqoop import tutorial http://jugnu-life.blogspot.in/2012/03/sqoop-import-tutorial.html then , lets try to do conditional import from RDBMS in sqoop

$sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1 The sqoop command above would import all the rows present in the table Customer. Let say that customer table is something like this  CustomerName DateOfJoining Adam 2012-12-12 John 2002-1-3 Emma 2011-1-3 Tina 2009-3-8 Now lets say we want to import only those customers which are joining after 2005-1-1 We can modify the sqoop import as$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret --where "DateOfJoining > '2005-1-1' "

This would import only 3 records from above table.

Happy sqooping :)

### Sqoop installation tutorial

Sqoop is a tool which is used to import / export data from RDBMS to HDFS

It can be downloaded from the apache website. As of writing this post the Sqoop is in incubation project with apache , but it would come as full project in the near future.

Sqoop is a client tool , you are not required to install it to all nodes of Cluster. The best practice is to just install it on client ( or edge node of the cluster) . The data transfer is direct between Cluster and Database , incase you are worried for traffic between machine where you install Sqoop and Database.

Installation steps

http://sqoop.apache.org/

The installation is fairly simple to start off for development purpose with Sqoop

Extract it in some folder

Specify the SQOOP_HOME and add Sqoop path variable so that we can directly run the sqoop commands

For example i downloaded sqoop in following directory and my environment variables look like this

export PATH=$PATH:$SQOOP_HOME/bin

Sqoop can be connected to various types of databases .

For example it can talk to mysql , Oracle , Postgress databases. It uses JDBC to connect to them. JDBC driver for each of databases is needed by sqoop to connect to them.

JDBC driver jar for each of the database can be downloaded from net. For example mysql jar is present at link below

Download the mysql j connector jar and store in lib directory present in sqoop home folder.

Thats it.

Just test your installation by typing

$sqoop help You should see the list of commands with there use in sqoop Happy sqooping :) ### Sqoop import tutorial This tutorial explains how to use sqoop to import the data from RDBMS to HDFS. Tutorial is divided into multiple posts to cover various functionalities offered by sqoop import The general syntax for import is $ sqoop-import (generic-args) (import-args)


Argument Description
--connect <jdbc-uri>Specify JDBC connect string
--connection-manager <class-name>Specify connection manager class to use
--driver <class-name>Manually specify JDBC driver class to use
--hadoop-home <dir>Override $HADOOP_HOME --helpPrint usage instructions -PRead password from console --password <password>Set authentication password --username <username>Set authentication username --verbosePrint more information while working --connection-param-file <filename>Optional properties file that provides connection parameters Example run$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1

When we run this sqoop command it would try to connect to mysql database named CompanyDatabase with username root , password mysecret and with one map task.

One more thing which we should notice is the use of localhost as database address , if you are running your hadoop cluster in distributed mode than you should give full hostname and IP of the database.

Purpose of post is to explain how to install hadoop in your computer. This post considers that you have Linux based system available for use. I am doing this on Ubuntu system

Before you begin create a separate user named hadoop in the system and do all these operations in that.

This document covers the Steps to
1) Configure SSH
2) Install JDK

#sudo apt-get update

You can directly copy the commands from there and run in your system

Hadoop requires that various systems present in cluster can talk to each other freely. Hadoop use SSH to prove the identity for connection.

#sudo apt-get install openssh-server openssh-client
#ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
#cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#sudo chmod go-w $HOME$HOME/.ssh
#sudo chmod 600 $HOME/.ssh/authorized_keys #sudo chown whoami$HOME/.ssh/authorized_keys

#ssh localhost
Say yes

It should open connection with SSH
#exit
This will close the SSH

Java 1.6 is mandatory for running hadoop

#sudo mkdir /usr/java
#cd /usr/java

Install java
#sudo chmod o+w jdk-6u31-linux-i586.bin
#sudo chmod +x jdk-6u31-linux-i586.bin
#sudo ./jdk-6u31-linux-i586.bin

Extract it into some folder ( say /home/hadoop/software/20/ )

Go to conf directory in hadoop folder and open core-site.xml and add the following property in blank configuration tags

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost</value>
</property>
</configuration>

Similarly do for

conf/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

conf/mapred-site.xml:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

Environment variables

In hadoop_env.sh file , change the JAVA_HOME to location where you installed java
e.g
JAVA_HOME = /usr/java/jdk1.6.0_31

Configure the environment variables for JDK , Hadoop as follows

Go to ~.profile file in the current user home directory

You can change the variable paths if you have installed hadoop and java at some other locations

export JAVA_HOME="/usr/java/jdk1.6.0_31"
export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:$HADOOP_INSTALL/bin

Format the HDFS

hadoop@jj-VirtualBox:~$start-dfs.sh starting namenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-jj-VirtualBox.out localhost: starting datanode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-jj-VirtualBox.out localhost: starting secondarynamenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-jj-VirtualBox.out hadoop@jj-VirtualBox:~$ start-mapred.sh

Open the browser and point to page

localhost:50030
localhost:50070

It would open the status page for hadoop

Thats it , this completes the installation of Hadoop , now you are ready to play with it.

### http://localhost:50070/dfshealth.jsp crash

I get often this problem that http://localhost:50070/dfshealth.jsp crashes and it doesn't show up anything.

I am running pseudo mode configuration.

One of the temporary solution which i found online was to format dfs again but this is very frustrating.

Also in

http://localhost:50030/jobtracker.jsp

Jobtracker history i get the following message

HTTP ERROR 500

Problem accessing /jobhistoryhome.jsp. Reason:

INTERNAL_SERVER_ERROR

http://localhost:50030/jobhistoryhome.jsp

I see similar problem was also observed here

Solution

If you see carefully the log of namenode

We have error as

org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.

This says that following variables are not properly set

Normally this is due to the machine having been rebooted and /tmp being cleared out. You do not want to leave the Hadoop name node or data node storage in /tmp for this reason. Make sure you properly configure dfs.name.dir and dfs.data.dir to point to directories
outside of /tmp and other directories that may be cleared on boot.

The quick setup guide is really just to help you start experimenting with Hadoop. For setting up a cluster for any real use, you'll want to
follow the next guide - Cluster Setup -

So here is what i did in hadoop-site.xml added the following two properties and now its working fine

<property>
<name>dfs.name.dir</name>