How Google Mesa works ( short summary )

Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google's Internet advertising business.

Mesa leverages common Google infrastructure and services, such as Colossus (Google’s next-generation distributed file system)  BigTable, and MapReduce. To achieve storage scalability and availability

http://research.google.com/pubs/pub42851.html

Characteristics and Goals


  • Data is horizontally partitioned and replicated.
  • To achieve consistent and repeatable queries during updates,the underlying data is multi-versioned.
  • To achieve update scalability, data updates are batched, assigned a new version number, and periodically (e.g., every few minutes) incorporated into Mesa.
  • To achieve update consistency across multiple data centers, Mesa uses a distributed synchronization
  • protocol based on Paxos


How it is different from existing Google tools


  • Megastore, Spanner, and F1 all three are intended for online transaction processing they do provide strong consistency across geo-replicated data but they do not support the peak update throughput needed by clients of Mesa.
  • Mesa does leverage BigTable and the Paxos technology underlying Spanner for metadata storage and maintenance.


What to learn

Schema changes for a large number of tables can be performed dynamically and efficiently without affecting correctness or performance of existing applications

How it works


  • It uses associative and commutative functions based  aggregations in tables
  • While new version-ed information is being calculated old version is used to server the applications
  • When all calculations are over the version is incremented and users issue queries against new version
  • Upstream systems generate updated data in batches
  • The committer assigns each update batch a new version number and publishes all metadata associated with the update (e.g., the locations of the files containing the update data) to the versions database, a globally replicated and consistent data store build on top of the Paxos consensus algorithm.

Schema changes handling


The method Mesa uses to perform online schema changes is to 
(i) make a separate copy of the table with data stored in the new schema version at a fixed update version, (ii) replay any updates to the table generated in the meantime until the new schema version is current, and 
(iii) switch the schema version used for new queries to the new schema version as an atomic controller BigTable metadata operation. 

Older queries may continue to run against the old schema version for some amount of time before the old
schema version is dropped to reclaim space.

Multi Operating system bootable USB

 

I was looking to install multiple linux distributions on my USB and i found this tool

http://www.pendrivelinux.com/yumi-multiboot-usb-creator/

It asks for OS to install

Then ISO path

Then at run time it asks which OS to boot from.

See the screen shots below.

 

YUMI - Multiboot USB Creator

YUMI - Multiboot Boot Menu

Large scale Real time bidding for advertisement


Behavioural Targeting
Based on short term and long term activities of users
e.g User searched for new car
e.g User searched for new movie
Typical advertisement pipeline
image
Simple HBase data mode
Single Column family
Column Qualifier
<date><hour>:<type><value>

image
Challenges
User profile freshness
    Stagered freshness   
    Update after every few hours
Scaling
    Partition by geo location
    Apps interactions
Pipeline failure
HBase scaling issues
    Salting of keys for even spread
    Optimal pre spits for reading also
   
http://www.slideshare.net/Hadoop_Summit/how-did-you-know-this-ad-would-be-relevant-for-me

Now see how Yahoo solves the similar problem
image
http://www.slideshare.net/Hadoop_Summit/interactive-analytics-in-human-time
Yahoo makes use of druid
http://druid.io/druid.html

Lambda Architecture implementation with Summingbird

Solving large scale similarity problems in Bigdata


Handling large scale similarity problems
Example of problems
Similarities between two texts
two persons
two shopping carts
How to measure
http://en.wikipedia.org/wiki/Similarity_measure
Jaccard Coefficient
http://en.wikipedia.org/wiki/Jaccard_index
Cosine Similarity
http://en.wikipedia.org/wiki/Cosine_similarity
Inner products of two vectors
Radical basics function kernel
http://en.wikipedia.org/wiki/Radial_basis_function_kernel
Other real world examples
Bank using the data provided while opening account
Links

http://asterix.ics.uci.edu/fuzzyjoin/
http://www.slideshare.net/Hadoop_Summit/similarity-at-scale-35988496

Removing the git merge commit if no commit has been made to master

Removing the merge commit if no commit has been made to master

$ cd gitlearn/
$ ls
$ git init
Initialized empty Git repository in /Users/jaggija/dev_home/code/open/gitlearn/.git/

Create some text file in master

$ pwd
/Users/jaggija/dev_home/code/open/gitlearn
$ vi readme.txt
$ git add readme.txt 
$ git commit -m "Initial commit"
[master (root-commit) 5202907] Initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 readme.txt

Create a new branch and edit the text file

$ git checkout -b "new branch"
fatal: 'new branch' is not a valid branch name.
$ git checkout -b "newbranch"
Switched to a new branch 'newbranch'
$ git checkout newbranch
Already on 'newbranch'
$ vi readme.txt 
$ git add readme.txt 
$ git commit -m "new branch"
[newbranch 0ec90b8] new branch
 1 file changed, 2 insertions(+)


See the log


$ git log
commit 0ec90b8fd3174fe0ad8d0e538f5c7a162a0b2216
Author: Jagat Singh <jaggija@cba.com.au>
Date:   Mon Jul 21 10:17:32 2014 +1000

    new branch

commit 52029078b1635437aec297321c5d23274267e7d6
Author: Jagat Singh <jaggija@cba.com.au>
Date:   Mon Jul 21 10:16:23 2014 +1000

    Initial commit


Switch to master and do the merge


$ git checkout master
Switched to branch 'master'

$ git merge --no-ff newbranch
Merge made by the 'recursive' strategy.
 readme.txt | 2 ++
 1 file changed, 2 insertions(+)


$ git log
commit 4778269d6e3a8be0a68b904653f349b0252736b0
Merge: 5202907 0ec90b8
Author: Jagat Singh <jaggija@cba.com.au>
Date:   Mon Jul 21 10:19:43 2014 +1000

    Merge branch 'newbranch'

commit 0ec90b8fd3174fe0ad8d0e538f5c7a162a0b2216
Author: Jagat Singh <jaggija@cba.com.au>
Date:   Mon Jul 21 10:17:32 2014 +1000

    new branch

commit 52029078b1635437aec297321c5d23274267e7d6
Author: Jagat Singh <jaggija@cba.com.au>
Date:   Mon Jul 21 10:16:23 2014 +1000


See the graph

$ git log --oneline --graph
*   4778269 Merge branch 'newbranch'
|\  
| * 0ec90b8 new branch
|/  
* 5202907 Initial commit


Lets fix the merge commit

$ git reset --hard 5202907
HEAD is now at 5202907 Initial commit
$ git status
On branch master
nothing to commit, working directory clean

Do fast forward merge

$ git merge --ff-only newbranch
Updating 5202907..0ec90b8
Fast-forward
 readme.txt | 2 ++
 1 file changed, 2 insertions(+)

See the log again

$ git log --oneline --graph
* 0ec90b8 new branch
* 5202907 Initial commit
$ git status
On branch master
nothing to commit, working directory clean
$

Done


Coding standard guidelines with Bigdata projects

Coding standard guidelines with Bigdata projects

With lots of technologies and tools coming into picture it becomes
extremely important that all team members follows the coding guidelines.

I am documenting standards which should be followed.

R

Google R style guide
https://google-styleguide.googlecode.com/svn/trunk/Rguide.xml

Related : RLint: Reformatting R Code to Follow the Google Style Guide
RLint is simple utility to reformat your code to Google R Style guide
https://code.google.com/p/google-rlint/
http://research.google.com/pubs/archive/42577.pdf

Hive

We made our own internal standard to follow.

Python

http://google-styleguide.googlecode.com/svn/trunk/pyguide.html
Google Style VIM settings Python
http://google-styleguide.googlecode.com/svn/trunk/google_python_style.vim

Scala

http://docs.scala-lang.org/style/

http://www.scalastyle.org/
It has plugins for sbt , eclipse etc

Java

http://google-styleguide.googlecode.com/svn/trunk/javaguide.html

Eclipse settings
http://google-styleguide.googlecode.com/svn/trunk/eclipse-java-google-style.xml

Intellij settings Google style guide
http://google-styleguide.googlecode.com/svn/trunk/intellij-java-google-style.xml

Shell scripts

http://google-styleguide.googlecode.com/svn/trunk/shell.xml



Scala in Action Chapter 3 Classes and Objects notes

class MongoDBClinet(val host: String, val port: Int)

val means these values are immutable

Primary constructor - It is called when object is created or called from overloaded constructors

If use var then scala also creates getters and setters
If use val then scala creates only getters , remember val are not changable so no setters

When both val and var are missing the instance values are treated as private and not accessible to anyone outside the class

Section 3.1

Download mongodb from http://www.mongodb.org/downloads

Extract it to some location

Start the mongodb

cd C:\Jagat\tools\mongodb-win32-i386-2.4.9\
mkdir db_data
bin\mongod.exe --db_path=db_data

:http://127.0.0.1:28017/

This will show admin path

The client waits at port  27017

Section 3.2

Classes and Objects

Construct the class with default values for host and port

class MongoClient ( val host : String , val port : Int ) {
def this() = this("127.0.0.1",27017)
}

The first statement in overloaded constructor has to be either other overloaded consturcotr or the primaty constructor.

To do some otherration before you invoke construcotr we use companion objects

Add the mongodb jar driver to class path of REPL session


Add jar to scala REPL classpath

scala> :cp C:\Jagat\tools\scala-eclipse\workspace\scalaination\lib\mongo-java-dr
iver-2.12.0.jar
Added 'C:\Jagat\tools\scala-eclipse\workspace\scalaination\lib\mongo-java-driver
-2.12.0.jar'.  Your new classpath is:

Section 3.3

Packaging

In scala you can have nested packages

package A {

package B {


}


}

You can also use java style packaging with package declared at the top of the file

Scala packaging structure does not have to be matching with the folder structure in file system like Java. But when we compile the classes the required folder structure is auto generated as JVM needs that for working. Remember under the hood scala is running on the top of JVM

To add the jar in scala classpath via command line

scalac -classpath my.jar MyNewClass.sclaa

Section 3.4

Scala imports

You can add import statement at any point in the code.

The import will be visible lexically in the code.

* equivalent of java is _ in scala

To import all use

my.package._

If you declare classes or objects without any package , then they all belong to empty package. The cannot be imported to any other package. But members of empty package can see each other

Remap package scala class to avoid conflicts

import java.sql.{Data => SqlDate }

Hide a class , the Date class below cannot be used

import java.sql.{Data => _ }

Section 3.5

Objects and companion objects

There is no static variables in scala

Scala supports concept of having companion object and classes.

bstract class Role {

  def canAccess(page: String): Boolean
}

class Root extends Role {
  override def canAccess(page: String) = page != "Admin"

}

object Role {

  def apply(roleName: String) = roleName match {
    case "root" => new Root
    case "analyst" => new Root
  }
}

Note the object Role and class Role in above code , they are companion object and companion class

Package objects

Package objects allow you to define something at central place which can be used by all the memebers of the package

They are generally defined in file named

package.scala , in the package that corresponds to it

package object ch3 {
 
  val minAge = 18

}

This above variable can be used anywhere in this package named ch3



Scala in Action Chapter 1 Why scala notes

Mixin : Class that provides certain functionality that can be inherited by subclass. But it is not made to be instantiated self. It can be seen as interface with implemented methods

Self Type : A mixin doesnt depend on any method or fields of class that it is mixed into. Sometimes it is useful to use those. This is known as self Type

Type abstraction
  • Parametrization
  • Abstract members

Concepts

  • Referential transparency : Value can be replaced by expression which calculated that value
  • Higher order functions : Functions that take functions as input
  • Lexical closures
  • Pattern matching
  • Single assignment (val)
  • LAzy evaluation
  • Type inference
  • Tail call recursion
  • List comprehension
  • Monodaic effects
  • No side effects
Types of languages

Static
Variables have type , values have type and are checked at compile time
Dynamic
Values have types , variables dont have type

Type inference : Compiler try to infer the type at compile time

Macros : Functions loaded at compile time and used by compiler. Compile time meta programming