Tuesday, June 8, 2010

Searching and Ad hoc Reporting on Hadoop using Hive

Recently I was assigned a task to evaluate and Integrate Hadoop (Apache Distributed Computing Framework) computing capabilities and use it along with Hive(Data warehousing Interface) for providing the searching and ad hoc reporting on it through Intellicus.

I had initially done a bit of RnD on this front before this task was assigned to me officially. So that helped me in understanding the nuisances of the technology and hence I could start working on this right away.

After a small session on hadoop and hive with Sanjay and Sunil from Ilabs Team we had clearly identified the gaps on which we were required to work to meet the laid down requirements.

Before I move ahead on this let me try and explain some of the buzz works that might be handy when you read and try to understand the content and context of this blog.

Hadoop - A open source distributed computing framework from apache. Uses Map/Reduce for distributed computation and HDFS for data storage/retrieval requirements.

Hive - The Data warehousing Interface for hadoop for searching and ad hoc querying on the stored data.

Map/reduce - A problem solving tech. originally devised by Google that breaks a problem into smaller parts and then computes the final results by collaborating the intermediate results obtained by computing the smaller problem. Works on the philosophy of "Moving data for computation is always costlier than provide computation logic to the data itself"

HDFS - Hadoop Distributed File System is basically used to provide an abstraction to store the huge data in distributed form on multiple nodes and provide a single logical view on that storage. Uses data replication for providing fault tolerance. Data is divided into smaller blocks and stored on Data Nodes instead of a single machine.

Now let me discuss how we went ahead with the implementation for the same.

We started with using the Apcahe JDBC driver that comes with the hive client installation and used it to communicate with the hive server. The Server tends to run on the same machine which also runs the hadoop server.

After starting the hadoop server that includes the namedNode, dataNodes(both for HDFS) and JobTracker(For Map/Reduce Jobs) on my server machine(Ubuntu 10.04 server) I configured and started the Hive server on the same machine to enable its communication with my hadoop server installation.

The Hive server is basically responsible for accepting the requests through JDBC/ODBC interfaces and convert them into Map/reduce Jobs that work on data stored on HDFS and provide the results back to the caller in form of a resultset.

Though the data is always stores on the HDFS as flat files, hive does creates a logical RDBMS layer over those flat files to give a relational view to the data being stored through Hive. Therefore, this allows the user to perform searching and querying on the same data much in a similar way one would do it on any other relational database like Oracle, MySQL etc....

Now going back to how we used the Apache JDBC driver, Let me tell you that it comes with a very limited set of implemented methods which are required to make a driver JDBC compatible. Though in Intellicus, we do not use all the JDBC methods yet we required a good number of Implementations to be done to make it compatible for our usage. Hence I worked along with Sunil and identified those gaps and he helped me with those implementation.

Once we had completed those implementation, we started using the modified driver for Intellicus. Some of the gaps that we worked on were like:

1. Implementing the Database Metadata Methods.
2. Correcting the logic of returning the column Names in case the query is executed as a Map/Reduce Job.
3. Other trivial issues related to creation of Statement and resultset Objects in Hive JDBC driver.

Sunil and Sanjay from Impetus ILabs are also planning to push these enhancements as contribution to the hadoop open source implementation and probably then we will be able to get these changes as a part of the next release of hive.

After completing these changes, It was pretty simple to integrate Hive within Intellicus as our architecture allows us to Integrate with any JDBC compatible data source in a very simple way. We just wrote one class and it was all done.

Though we had some problems in terms of the handling the date datatype as Hive does not have an in-built support for this data type, yet we were able to get though this as it supports long data type. So we started converting the date type operations into unix-timestamp(long) type operations before submitting our queries to the hive server. This is being done intentionally to allows us to use the capabilities of hadoop for computing the data in a distributed environment instead of bring in any performance bottlenecks at our end.

Now, as the hard part is done... we need to further explore the possibilities of using the hadoop capabilities for meeting the custom business needs of our customers. Hope we will come though that challenge as well.

Until next time keep reading.....

Sunday, March 7, 2010

Java Tip for Object Serialization

Ever faced a dilemma over how could you control the Java Serialization mechanism. Recently, I was in a similar situation where one of our third party classes actually broke its backward compatibility by changing the Serial version UID that resulted in failure to read serialized objects for that class.

One of the Threads on the java forum helped me in finding a solution to the above problem. I thought it would be worth sharing it with you all.

The problem was addressed by writing a class that extends the ObjectinputStream class and overriding the readClassDescriptor Method that return the latest description of the class being deserialized.

Here is a sample code that explains it.

public class CustomObjectInputStream extends ObjectInputStream

{

/**

* Collection of classes for which this hook should come into picture.

*/

private static final ArrayList hookedClasses = new ArrayList();

static

{

hookedClasses.add("org.apache.xerces.dom.AttributeMap");

}

public CustomObjectInputStream(InputStream in) throws IOException

{

super(in);

}

/*

* (non-Javadoc)

* @see java.io.ObjectInputStream#readClassDescriptor()

*/

protected ObjectStreamClass readClassDescriptor() throws IOException,

ClassNotFoundException

{

// initially streams descriptor

ObjectStreamClass resultClassDescriptor = super.readClassDescriptor();

Class localClass = Class.forName(resultClassDescriptor.getName());

if (localClass == null)

{

Logger.error("No local class for "

+ resultClassDescriptor.getName());

return resultClassDescriptor;

}

if (!hookedClasses.contains(localClass.getName()))

{

return resultClassDescriptor;

}

ObjectStreamClass localClassDescriptor = ObjectStreamClass

.lookup(localClass);

if (localClassDescriptor != null)

{ // only if class implements serializable

final long localSUID = localClassDescriptor.getSerialVersionUID();

final long streamSUID = resultClassDescriptor.getSerialVersionUID();

if (streamSUID != localSUID)

{ // check for serialVersionUID mismatch.

final StringBuffer s = new StringBuffer(

"WARNING: Overriding serialized class version mismatch for class: "

+ localClass.getName());

s.append(" local serialVersionUID = ").append(localSUID);

s.append(" stream serialVersionUID = ").append(streamSUID);

Logger.debug(s.toString());

resultClassDescriptor = localClassDescriptor; // Use local class

// descriptor

// for

// deserialization

}

}

return resultClassDescriptor;

}

}

Wednesday, July 22, 2009

Cloud on Ubuntu 9.04 using Xen and Eucalyptus

Working on Linux has always fascinated me..... and yet again it proved a bliss.

Lately I worked on configuring a cloud using Ubuntu 9.04 along with Xen and Eucalyptus. Started on a pretty novice note slowly it helped me in getting into myself as the days went by. Each new milestone gave a sense of self satisfaction and a proud feeling of making it my day. These things do a lot to make yourself feel special.

Anyways, lot has been said about how I feel but now lets talk about what I did and How I did....


I got a call from Ankit(my lead) that we need to do some RnD on how we can deploy Intellicus(Our Product) on a cloud and what all challenges we expect in doing so.

Suddenly, this lead me to hop to my NMG guys and check what I can expect them to help me with in this repect. My biggest concern was to take hold of machines that I would need to complete this task. And as usual my buddy from NMG Abhishek Sarkar helped me with all the logistics that I needed to get started.

I started reading articles and white papers related to the topic and initially could hardly found anything that could have helped me in getting the head start.

Even though little was getting materialized during my initial phases of RnD I tried to put in extra efforts to see how I can get this through. Poking Abhishek again and again I use to discuss him to get the clear understanding of the things like Hypervisor in Linux and its role in creating a cloud.

After two days of efforts, I decided to take help from out iLabs team (Mansij,Akash and Abhishek) who has already explored these areas.

These guys helped me to get through with what kind of platform should I use and the Hardware that I need to accomplish what I want.

They suggested me to go with Ubuntu 9.04 GA which includes support for Eucalyptus 1.5 and kvm for Hypervisor. The Machine that I was told to acquire should atleast have 4Gigs of RAM to avoid any road blocks later during my exercise.

I instantly forwarded the request to Ankit(My lead) and Rajesh(My Manager) to help me acquire these resources. With in next few days I had 2 Machines at my disposal one with 1 gigs of RAM for Cloud controller and another with 4 Gigs for Nodes.

I started with the help of Akash and Abhishek using their experience to get over this. But things are not as smooth as you expect them to be.

Now before I move forward let me talk a little about what cloud commuting and other relevant buzzwords are in the cloud computing jargon. Pardon me if you get to see these definitions on net as I kind of felt that it would be better to get them from Internet rather then explaining them on my own.

Cloud
cloud is a physical place, perhaps owned and controlled by some other entity, and it contains computing resources that are available pretty much on demand for a price.

Node
A machine which is the part of the cloud network and lends resources to that cloud.

Hypervisor
A technique similar to emulating one machine to act as multiple machines with heterogeneous environment. This is similar to VM ware that we all might be aware of. Two most popular ways to achieve this in Linux is by using XEN and KVM.

Eucalyptus
An open source framework provided to simulate the cloud setup in your own network similar to Amazon Ec2(Amazon Elastic Cloud Computing).

Cloud Controller
A module of Eucalyptus that helps managing the cloud

Node Controller
A module of Eucalyptus that helps managing the Nodes in a given cloud.

Kernel
kernel is the central component of most computer operating systems. Its responsibilities include managing the system's resources (the communication between hardware and software components).

Ramdisk
A RAM disk is a portion of RAM which is being used as if it were a disk drive. RAM disks have fixed sizes, and act like regular disk partitions. Access time is much faster for a RAM disk than for a real, physical disk. However, any data stored on a RAM disk is lost when the system is shut down or powered off. RAM disks can be a great place to store temporary data.

Machine Image
The file that image of the that contains the configuration and application that you want to deploy on the cloud.


Now as you are aware of these buzz words that I will refer again and again in my blog further lets move ahead.

Installing Ubuntu 9.04 along with Kvm support itself took around 2 days and yet we were not able to get it working.

Finally I had to work on it alone as I could not expect them to help me with more time as they were already busy with there own assignments. After 2 days I released that the hardware I have acquired does not support KVM. For KVM to work your processor needs to be of Intel-VT or AMD-V family. Not to my surprise, mine belonged to none :(.

Then I had to fall back to other option of hypervisor that is currently supported by Eucalyptus that is Xen. Using Amazon EC2 command line tools I started working on creating a machine image and deployed it on the configured cloud. But even that failed to work for me and Eucalyptus being relatively new I could hardly find anything that could have taken me through.

Even help from Akash and Abhishek could not get this working although they had already dirtied their hands using these tools. But when things don't have to work they don't. I hope everyone would agree to this.

Anyways, Finally i decided to take it head on and I started reading it from the scratch about the hypervisor Xen. I planned to take it step by step.

Now I will write the Steps that I followed to achieve my goal:

1. Install Ubuntu 9.04 and Set up Eucalyptus

  • Install Ubuntu 9.04 afresh on CC machine and the Node Controller machine.
  • Download eucalyptus-cc and eucalptus-cloud on the CC machine and install them. You can use apt-get install eucalyptus-cc eucalptus-cloud to install them.
  • Download the Node Controller on the Node Controller Machine (the one with 4 gigs RAM) and install it using apt-get install eucalyptus-nc.
2. Install the xen on Node Controller machine.

3. Create Image using Xen create Image Utility

  • One you have successfully installed xen now its time for you to test an independent xen image using vmbuilder - a xen based image creation utility.
  • This can be done by typing: vmbuilder xen ubuntu --firstboot ./firstboot \
    --mirror=http://ipAddressofaptproxymachine:9999/ubuntu -o --tmpfs=- \
    #--ip 192.168.152.211 \
    #--mask 255.255.255.0 \
    #--net 192.158.152.0 \
    #--gw 192.168.152.1 \
    #--dns 192.168.150.42 --dns 192.168.150.43 \
    --addpkg vim --rootsize=10240 --arch i386 --verbose --debug \
  • This command will install the vim package to the image and the size of the image will be 10 GB. Also I have commented few more parameters to show that they can be optinally provided while image creation. Else by default the image will use the default newtwork setting of your network.
  • The firstboot file is used to execute some script that you want during the first boot of this image.
  • In my case I wanted to install the openssl-server(apt-get install openssl-server) so that I can log in the image using ssl.
  • Once this image is successfully created, It is always recommended to test this image independently without using the Eucalyptus. This is to ensure that the xen on the Node controller and the image creation utility is working as expected. Moreover, reducing the more control flows while you work on something new like this its always a good approch to 'work a little and test a little'. I have learned it the hard way anyways :(
  • Before you can actually test this image some prerequesite steps needs to be performed. You need to download and install the bridge-utils package(apt-get install bridge-utils).
  • After doing that you need to start the bridge. use network-script for doing that in the /etc/xen/scripts folder.
  • Once bridge is up and you have created the image successfully, it could be another herculean effort get this working in a single go. It was for me atleast.
  • Being new to Xen, Lot of things that I need to take of were missed.
  • The default mount drives for root and swap space the image will look for needs to be corrected in the /etc/xen-tools/xen-tools.conf the disk-device needs to be set to sda and serial-device needs to be set to tty1. If in case you fail to do so, as I did in my initial days of this task, I use to successfully modify the conf file created while creating the image.
  • One thing that I am still unable to figure out is that when I create an Image, My image does not include the modules from the /lib/modules/myKernel Folder. So , I have to copy this folder manually by mounting the image to a temporary folder. Now the command for this came after to me after a lot of efforts I had already put in but finally it came from Abhishek Sarkar who told me how we can get through this problem.
  • Once these things are done, u can test your image using xm create and xm console commands. Use ssh to log in to the machine you launch.
4. Configuring Eucalyptus for Cloud deployment
  • If everything goes as expected, now you can go ahead for deploying the image on the cloud but before that you will have to do some configuration on the cloud controller and the node controller.
  • Go to the /etc/eucalyptus on the cloud controller and modify the eucalyptus.conf to provide the address of the node machine.
  • Go to the /etc/eucalyptus on the cloud controller and modify the eucalyptus.conf to provide the Hypervisor as xen
  • Once done, restart all three components the eucalyptus-cc eucalyptus-cloud and eucalyptus-nc.
  • After restart log in to https://cloudControllermachineIp:8443/ and put in ur credentials to manage the cloud.
  • Also configure the cloud space and ram settings to suite your need.
  • Once you fill the requested details you will then be able to download the X.509 certificates that enable you to access you cloud from any linux terminal using Amazon Ec2 tools.
5. Deploying the Image on Cloud

  • Once you are done with the Cloud configuration, now you need to create kernel, ramdisk and machine image using the kernel, ramdisk and the image file.
  • Configure your local machine to use the Amazon EC2 tools after downloading ec2-ami and ec2-api command line tools from the following location: http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=88
  • Once you downlaoded them, unzip the downloaded X.509 certificates and move them to ~/.euca/eucarc folder on the lcoal machine
  • execute source ~/.euca/eucarc to set the certificates as source for your Cloud communications.
  • Once done with this unzip the ec2-tools and add the directories to your environment path using the following commands: export EC2_HOME=/root/eucalyptus/tools/ec2-api-tools-1.3-30349
    export EC2_AMITOOLS_HOME=/root/eucalyptus/tools/ec2-ami-tools-1.3-26357
    export PATH=$PATH:$EC2_HOME/bin
    export PATH=$PATH:$EC2_AMITOOLS_HOME/bin
  • Now our environment is ready to deploy the create and deploy the bundles for the ramdisk, kernel and the machine image.
  • Use the following commands for doing that:
Creating the kernel bundle
mkdir intellikernel
ec2-bundle-image -i /boot/vmlinuz-2.6.28-13-server -d ./intellikernel --kernel true
ec2-upload-bundle -b intellikernel -m ./intellikernel/vmlinuz-2.6.28-13-server.manifest.xml
EKI=`ec2-register intellikernel/vmlinuz-2.6.28-13-server.manifest.xml | awk '{print $2}'`
echo $EKI

Creating the Ramdsik bundle
mkdir intelliramdisk
ec2-bundle-image -i /boot/initrd.img-2.6.28-13-server -d ./intelliramdisk --ramdisk true
ec2-upload-bundle -b intelliramdisk -m intelliramdisk/initrd.img-2.6.28-13-server.manifest.xml
ERI=`ec2-register intelliramdisk/initrd.img-2.6.28-13-server.manifest.xml | awk '{print $2}'`
echo $ERI


Creating the Image Bundle
mkdir intelliimage
ec2-bundle-image -i ubuntu-xen/root.img -d ./image --kernel $EKI --ramdisk $ERI
ec2-upload-bundle -b intelliimage -m ./intelliimage/root.img.manifest.xml
EMI=`ec2-register intelliimage/root.img.manifest.xml | awk '{print $2}'`
echo $EMI

Importing the keypair
ec2-add-keypair intellikey > ~/.euca/intellikey.priv

chmod 0600 ~/.euca/intellikey.priv

Launching the Deployed EMI
ec2-run-instances $EMI -k intellikey


Logging in into the launched EMI
ssh -i ~/.euca/intellikey.priv root@ipaddressofThemachine

Once done with all this now you are all set to use your application which has been deployed on a private cloud.

As I completed this with great satisfaction finally it also led me to provide a helping hand to the iLabs team while they were planning to create the cloud in their environment. They needed to do that as they had a presentation of it which they wanted to give to their ADoE.

Now we regularly exchange our ideas with each other to extend the horizons of out learning and help each other.

Next I am looking forward to bring in a load Balancer and work with multiple instances on my application to be a part of the same cloud. Hope that I will complete that task as well and come back with some more thoughts on it.


Until then keep clouding :D....