Kiji Maven Setup

To develop applications using Kiji schema and Mapreduce , add following to your maven

If you have custom changes to your maven settings.xml then add following in relevant sections. Otherwise you can download the settings.xml provided by kiji website

 

<profile>
     <repositories>
       <repository>
         <id>kiji-repos</id>
         <name>kiji-repos</name>
         <url>https://repo.wibidata.com/artifactory/kiji</url>
       </repository>
     </repositories>
     <pluginRepositories>
       <pluginRepository>
         <snapshots>
           <enabled>true</enabled>
         </snapshots>
         <id>kiji-plugins</id>
         <name>kiji-plugins</name>
         <url>https://repo.wibidata.com/artifactory/kiji</url>
       </pluginRepository>
     </pluginRepositories>
     <id>kiji-profile</id>
   </profile>

 

  <activeProfiles>
  <activeProfile>kiji-profile</activeProfile>
</activeProfiles>

 

Changes to pom.xml

 

Add the dependency like following

<dependency>
      <groupId>org.kiji.schema</groupId>
      <artifactId>kiji-schema</artifactId>
      <version>1.0.0</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.kiji.mapreduce</groupId>
      <artifactId>kiji-mapreduce</artifactId>
      <version>1.0.0-rc5</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.kiji.platforms</groupId>
      <artifactId>kiji-cdh4-platform</artifactId>
      <version>1.0.0</version>
      <scope>provided</scope>
    </dependency>   

 

For org.kiji.platforms

Please read below , you need to choose right version depending upon your hadoop cluster

https://github.com/kijiproject/kiji-platforms/blob/master/README.md

Installing Kiji Schema and Shell

Make sure that your Hadoop and HBase are running up.

If you need help to install Hadoop and HBase please see following posts

http://jugnu-life.blogspot.com.au/2012/03/hadoop-installation-tutorial.html

http://jugnu-life.blogspot.com.au/2013/03/hbase-pseudo-mode-install.html

Download Kiji Schema and extract it to some location

https://github.com/kijiproject/kiji-schema

Set following variables

export KIJI_HOME="/home/jj/software/wibi/kiji/kiji-schema-1.0.0-rc5"
export PATH=$PATH:$KIJI_HOME/bin

Install Kiji system tables

$ kiji install

 

jj@jj-VirtualBox:~$ kiji install
Warning: $HADOOP_HOME is deprecated.

Creating kiji instance: kiji://localhost:2181/default/
Creating meta tables for kiji instance in hbase...
13/03/30 18:03:37 INFO org.kiji.schema.KijiInstaller: Installing kiji instance 'kiji://localhost:2181/default/'.
13/03/30 18:03:43 INFO org.kiji.schema.KijiInstaller: Installed kiji instance 'kiji://localhost:2181/default/'.
Successfully created kiji instance: kiji://localhost:2181/default/

 

Installing Kiji Schema Shell

Download from

https://github.com/kijiproject/kiji-schema-shell

export KIJI_SHELL_HOME="/home/jj/software/wibi/kiji/kiji-schema-shell-1.0.0-rc5"
export PATH=$PATH:$KIJI_SHELL_HOME/bin

Start kiji shell by

 

jj@jj-VirtualBox:~$ kiji-schema-shell
Warning: $HADOOP_HOME is deprecated.

Kiji schema shell v1.0.0-rc5
Enter 'help' for instructions (without quotes).
Enter 'quit' to quit.
DDL statements must be terminated with a ';'
schema>

Congrats , you have installed Kiji Schema successfully. Lets play :)

Handle schema changes evolution in Hadoop

In Hadoop if you use Hive if you try to have different schemas for different partition , you cannot have field inserted in middle.

If the fields are added in end you can use Hive natively.

However things would break if field is inserted in middle.

There are few ways to handle schema evolution and changes in hadoop

Use Avro

For flat schema of a database tables ( or files ) ,  generate avro schema. This Avro schema can be used anywhere in programming or mapping it with hive using AvroSerde

https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html

I am exploring various JSON apis which can be used and also exploring various methods i can do.

http://www.infoq.com/articles/AVROSchemaJAXB

Nokia has released code to generate Avro schemas from XMLs

https://github.com/Nokia/Avro-Schema-Generator

Okay my problem statement and solution are simple.

The ideas in my mind are

  1. Store Schema details of Table in some database
  2. Read the database field details and generate Avro schema
  3. Store it to some location in Hadoop  /schema/tableschema
  4. Map the Hive to use this avro schema location in HDFS
  5. If some change comes in schema update the database and the system would again generate new avro schema
  6. Push the new schema to HDFS
  7. Hive would use new schema without breaking old data should be able to support schema changes and evolution for data in Hadoop

Most NoSQL databases have similar approach , check Oracle link below

http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/avroschemas.html

Oracle NoSQL solution manages the schema information and changes in KeyStore

http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/provideschema.html

 

Use ORC

https://github.com/hortonworks/orc

Hortonworks guys are working on new file format which have similar feature of storing schema within data like Avro

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html

The "versioned metadata" means that the ORC file's metadata is stored in ProtoBufs so that we can add (or remove) fields to the metadata. That means that for some changes to ORC file format we can provide both forward and backward compatibility.

ORC files like Avro files are self-describing. They include the type structure of the records in the metadata of the file. It will take more integration work with hive to make the schemas very flexible with ORC.

Jackson Tutorial

Jackson is an API to play with JSON using java.

http://wiki.fasterxml.com/JacksonHome

Some useful links

http://www.mkyong.com/java/how-to-convert-java-object-to-from-json-jackson/

http://wiki.fasterxml.com/JacksonInFiveMinutes

GSON tutorials

Few links of GSON Tutorials i found

Optional About JSON first link

http://www.w3schools.com/json/json_syntax.asp

https://sites.google.com/site/gson/gson-user-guide

http://www.mkyong.com/java/how-do-convert-java-object-to-from-json-format-gson-api/

http://www.javacreed.com/simple-gson-example/

http://camelcode.org/overview/Java-JSON-tutorials.htm#Google%20Gson

Install RHadoop on Hadoop Cluster

The instructions below can be used to install RHadoop rmr2 , rhdfs packages on Hadoop cluster. I have just single node cluster but it really don't matter. Same instructions apply if you have more machines.

Install R on machine by following the instructions at

http://jugnu-life.blogspot.com.au/2013/02/install-r-on-ubuntu_24.html

Lets start to install RHadoop

RHadoop packages are available at

https://github.com/RevolutionAnalytics

I have cloned the git repo for the packages , this makes easy to do any upgrades. So you have two choices here.

Easy 1 ) Download the tar.gz files for rmr2 , rhdfs

 

Easy 2 :) , Clone the git repo for each of them

Lets go with first one

Download the rmr2 and rhdfs from following locations

https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.5.tar.gz

https://github.com/RevolutionAnalytics/rmr2/blob/master/build/rmr2_2.1.0.tar.gz

https://github.com/RevolutionAnalytics/quickcheck/blob/master/build/quickcheck_1.0.tar.gz

Links might have changed while you are reading this , so pardon me and get latest links

 

I assume you have already installed R on your machine , it needs to be installed on all nodes in your cluster. And these Rhadoop packages also needs to be installed on all of the nodes.

 

Export variables needed

Change the location below depending on your install of Hadoop

 

sudo gedit /etc/environment

Add the following

# Variable added for RHadoop Install
HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop
HADOOP_CONF=/home/jj/software/hadoop-1.0.4/conf
HADOOP_STREAMING=/home/jj/software/hadoop-1.0.4/contrib/streaming/hadoop-streaming-1.0.4.jar

Tell R about Java

R at times is not able to figure out few java settings so lets tell and help R

 

# sudo R CMD javareconf JAVA=/home/jj/software/java/jdk1.6.0_43/bin/java JAVA_HOME=/home/jj/software/java/jdk1.6.0_43 JAVAC=/home/jj/software/java/jdk1.6.0_43/bin/javac JAR=/home/jj/software/java/jdk1.6.0_43/bin/jar JAVAH=/home/jj/software/java/jdk1.6.0_43/bin/javah

 

Please change the following

/home/jj/software/java/jdk1.6.0_43/bin

Depending on your Java location

 

$ sudo R CMD javareconf JAVA=/home/jj/software/java/jdk1.6.0_43/bin/java JAVA_HOME=/home/jj/software/java/jdk1.6.0_43 JAVAC=/home/jj/software/java/jdk1.6.0_43/bin/javac JAR=/home/jj/software/java/jdk1.6.0_43/bin/jar JAVAH=/home/jj/software/java/jdk1.6.0_43/bin/javah

Updating Java configuration in /usr/lib/R
Done.

 

Check cluster is up and happy

hadoop fs –ls /

jj@jj-VirtualBox:~/software/R/RHadoop$ hadoop fs -ls /
Warning: $HADOOP_HOME is deprecated.

Found 3 items
drwxr-xr-x   - jj supergroup          0 2013-03-29 09:59 /hbase
drwxr-xr-x   - jj supergroup          0 2013-03-09 21:57 /home
drwxr-xr-x   - jj supergroup          0 2013-03-09 17:35 /user

 

Install RJava

 

Start R with

#sudo R –save

 

We are just telling R to start with sudo and save settings we do now

 

>install.packages('rJava')

It will ask to choose CRAN server , select something near to you and let the install happen

After its done verify that its there :)

> library()

 

It will show something like

Packages in library ‘/usr/local/lib/R/site-library’:

rJava                   Low-level R to Java interface

Packages in library ‘/usr/lib/R/library’:

Quit R

> q()

All set

 

Install rhdfs now

Go to location where you downloaded tar.gz files and execute following command

 

jj@jj-VirtualBox:~/software/R/RHadoop/rhdfs/build$ ls
rhdfs_1.0.5.tar.gz

 

 

$ sudo export HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop R CMD INSTALL rhdfs_1.0.5.tar.gz

Check

> library('rhdfs')
Loading required package: rJava

HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop

Be sure to run hdfs.init()
> hdfs.init()
> hdfs.ls('/')
  permission owner      group size          modtime   file
1 drwxr-xr-x    jj supergroup    0 2013-03-29 09:59 /hbase
2 drwxr-xr-x    jj supergroup    0 2013-03-09 21:57  /home
3 drwxr-xr-x    jj supergroup    0 2013-03-09 17:35  /user
>

We are able to see HDFS files in R

So all done for rhdfs

 

Install rmr2

 

$ apt-get install -y pdfjam

 

> install.packages(c( 'RJSONIO', 'itertools', 'digest','functional', 'stringr', 'plyr'))

Download package from

http://cran.r-project.org/web/packages/reshape2/index.html

http://cran.r-project.org/web/packages/Rcpp/index.html

 

sudo R CMD INSTALL Rcpp_0.10.3.tar.gz

sudo R CMD INSTALL reshape2_1.2.2.tar.gz

sudo R CMD INSTALL quickcheck_1.0.tar.gz
sudo R CMD INSTALL rmr2_2.0.2.tar.gz

 

Done

More reading

http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/

SSH Putty tools and tips

Some useful SSH tools and tips

Use mputty for tabbed putty ssh sessions , lots of features

http://www.ttyplus.com/multi-tabbed-putty/

Make sequence files on disk

Create SequenceFiles from files on your local filesystem

Extract the contents of a SequenceFile back to the filesystem

Convert popular archive formats — tar (including tar.bz2 and tar.gz) and zip — to and from SequenceFile format.

http://www.exmachinatech.net/01/forqlift/

Process XML data in Hadoop

To read XML files

Mahout has XML input format , see the blog post below to read more

https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java

http://xmlandhadoop.blogspot.com.au/2010/08/xml-processing-in-hadoop.html

Pig has XMLLoader

http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/piggybank/storage/XMLLoader.html

Import export of Data to HDFS

Various tools and methods to import data to HDFS. Depending on what type of data and where it is located you can use following tools

Import from Database

Sqoop
http://sqoop.apache.org/
This tool can import data from various databases , custom connectos are also available for fast processing of import export of data.

Import of file based loads

Chukwa
http://incubator.apache.org/chukwa/

Scribe
https://github.com/facebook/scribe
Collect to one place and then push to HDFS

Flume
http://flume.apache.org/
Has verious sources and sink classes which can be used to push files to HDFS

HDFS File Slurper
https://github.com/alexholmes/hdfs-file-slurper
A basic tool to do import export

Regular tools
Use some automation tool like cron , autosys to push files to HDFS at some location
# hadoop fs -copyFromLocal src dest
# hadoop fs -copyToLocal src dest

Use oozie
Use oozie ssh action to login to machine and then execute the above two copy commands

Import export from HBase to HDFS

Use HBase export utility class
http://hbase.apache.org/book/ops_mgt.html#export

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> 
 
Import to HBase
 
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
 
The HBase export utility writes data in sequence file format. So you need to do its conversion

You can use Mapreduce to read from HBase and write to HDFS in plain text or any other format
 
 

HBase pseudo mode install

If you have done already , then install Hadoop by following

http://jugnu-life.blogspot.com.au/2012/03/hadoop-installation-tutorial.html

Lets get HBase working

Download HBase and Zookeeper Tar ball from Apache website

Extract to some place and set environment variables ( in say .profile of your home )

export HBASE_HOME="/home/jj/software/hbase-0.94.5"
export PATH=$PATH:$HBASE_HOME/bin



export ZOOKEEPER_HOME="/home/jj/software/zookeeper-3.4.5"
export PATH=$PATH:$ZOOKEEPER_HOME/bin



HBase settings

Check DNS settings

jj@jj-VirtualBox:~$ cat /etc/hosts
127.0.0.1    localhost
127.0.0.1    jj-VirtualBox


Check both IP should be same , by default in Ubuntu its not.
HBase expects the loopback IP address to be 127.0.0.1. Ubuntu and some other distributions, for example, will default to 127.0.1.1 and this will cause problems for you.

In hbase-env.sh

export JAVA_HOME="/home/jj/software/java/jdk1.6.0_43"

Changes in
hbase-site.xml properties

Add the following

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:8020/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
  </property>

In HDFS

Create hbase directory in HDFS

$ hadoop fs -mkdir /hbase


Zookeeper settings

In conf directory of Zookeeper

Rename zoo_sample.cfg
to zoo.cfg

Change the path of

dataDir=/home/jj/software/hadoopData/zookeeper

We are ready to test


Start Hadoop

$ start-all.sh


Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-namenode-jj-VirtualBox.out
localhost: starting datanode, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-datanode-jj-VirtualBox.out
localhost: starting secondarynamenode, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-secondarynamenode-jj-VirtualBox.out
starting jobtracker, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-jobtracker-jj-VirtualBox.out
localhost: starting tasktracker, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-tasktracker-jj-VirtualBox.out

Check Hadoop pages

localhost:50030
localhost:50070

All fine ?

Let start HBase

HBase automatically starts zookeeper also so no need to start by own

$ start-hbase.sh


localhost: starting zookeeper, logging to /home/jj/software/hbase-0.94.5/bin/../logs/hbase-jj-zookeeper-jj-VirtualBox.out
starting master, logging to /home/jj/software/hbase-0.94.5/logs/hbase-jj-master-jj-VirtualBox.out
localhost: starting regionserver, logging to /home/jj/software/hbase-0.94.5/bin/../logs/hbase-jj-regionserver-jj-VirtualBox.out


Check HBase pages

Master
localhost:60010
Region Server

localhost:60030

All done :)

Struck somewhere ?

Post below

Which process is using port

$ netstat -lnp | grep portNo

Note the process id and process name

To see full path of process use the command , replace the processID below

ls -l /proc/processID/exe

Example

ls -l /proc/1222/exe

Configure hiveserver2

Please configure Hive with MySQL first before starting hiveserver2

Follow this post

http://jugnu-life.blogspot.com.au/2012/05/hive-mysql-setup-configuration.html

You can use some other database like Oracle also.

Add the following settings in hive-site.xml

<property>
  <name>hive.support.concurrency</name>
  <description>Enable Hive's Table Lock Manager Service</description>
  <value>true</value>
</property>
<property>
  <name>hive.zookeeper.quorum</name>
  <description>Zookeeper quorum used by Hive's Table Lock Manager</description>
  <value>zookeeper_hos1,zookeeper_host2,zookeeper_host3</value>
</property>


<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>
 
<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>

From command prompt

Start hiveserver2

>hiveserver2

The hiveserver2 binary is present in bin folder of hive directory , so you can go to that folder and run it in case you installed hive via tar ball

How to check

Now just start beeline and see if things are working fine

The userame password are not requied if you haven’t configured any LDAP settings for hiveserver

$ /usr/lib/hive/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000> SHOW TABLES;
show tables;
+-----------+
| tab_name  |
+-----------+
+-----------+
No rows selected (0.238 seconds)
0: jdbc:hive2://localhost:10000>


The statergy behind Google reader phase out

If i see Google statergy to bring people who are using Google reader to G+ , then it makes sense for them. For example we all spend lots of time with facebook , if facebook have inbuit rss reader like Google reader people will inturn spend more time inside fb and also share the posts they read with friends network. Although there are apps which do this , but many dont explore fb apps.

Okay coming back to Google decision on Reader phasing out. If whole statergy is to make G+ as default place for people to work on in Google ecosystem then soon reader should be embedded here. So that i can share with my circles what i am reading and Google+ also becomes happy with increasing traffic.  The major issue with G+ is lack of colors :) which Google did not understand till now. People dont like dull whites with lots of hidden blurred buttons to make them think where to click :) , Are you doing any usability study for G+ ? World outside Google is not Geek they want colorful stuff.  #googlereader

Google are you listening ? :)

Find current shell in linux

~$ echo $SHELL

/bin/bash

Above shows its using bash

Decision Tree

Stuff useful for learning Decision Trees.

Intro from Wikipedia

http://en.wikipedia.org/wiki/Decision_tree

 

Decision Trees

Chapter 3

Book

Machine Learning by Tom Mitchell

Online lecture at (Highly recommended)

http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml

 

Decision Tree Applet

http://webdocs.cs.ualberta.ca/~aixplore/learning/DecisionTrees/index.html

This applet explains in a great way how the selection policy of root node effects the decision tree. You can play with each of those policies and how the outcome varies.

 

http://en.wikipedia.org/wiki/ID3_algorithm

http://en.wikipedia.org/wiki/C4.5_algorithm

Maven Plugins Where to define and Configure

There are the build and the reporting plugins:

  • Build plugins will be executed during the build and then, they should be configured in the <build/> element.
  • Reporting plugins will be executed during the site generation and they should be configured in the <reporting/> element.

 

Specify them in the <build><pluginManagement/></build> elements for each build plugins ( generally in a parent POM).

For reporting plugins, specify each version in the <reporting><plugins/></reporting> elements (and also in the <build><pluginManagement/></build> )

The configuration of plugin behaviour can be done using

<configuration>Configurations inside the <executions> tag differ from those that are outside <executions> in that they cannot be used from a direct command line invocation. Instead they are only applied when the lifecycle phase they are bound to are invoked. Alternatively, if you move a configuration section outside of the executions section, it will apply globally to all invocations of the plugin.

 

Read this and complete the understanding

http://maven.apache.org/pom.html
References
http://maven.apache.org/guides/mini/guide-configuring-plugins.html
http://maven.apache.org/xsd/maven-4.0.0.xsd

GUI Graphical Interface for HBase region servers

Hannibal is a tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

This helps us to answer following questions

  1.     How well are regions balanced over the cluster?
  2.     How well are the regions split for each table?
  3.     How do regions evolve over time
    

To install Hannilal follow the following steps , it wont take long.



https://github.com/sentric/hannibal


1)

Download the latest version

$ git clone https://github.com/sentric/hannibal.git

$ cd hannibal

2)

Edit your .profile or /etc/environment your choice

To include the following property

Change the version depending on your HBase version , options are 0.94 , 0.90 , 0.94

export HANNIBAL_HBASE_VERSION=0.94


3)

Copy the hbase-site.xml from your HBase conf to Hanniball conf directory

4) Build the project

It will take sometime as it will download dependencies from internet

$ ./build

After it show message as Success

5) Start the server

$ ./start

It will take sometime for server to start and then you can monitor it at

http://localhost:9000

You can configure the port incase you already have something running there.

Please note that history data about regions is only collected while the application is running, it will need to run for
some time until the region detail graphs fill up. 

Happy Hadooping :)