Tuesday, June 8, 2010

Searching and Ad hoc Reporting on Hadoop using Hive

Recently I was assigned a task to evaluate and Integrate Hadoop (Apache Distributed Computing Framework) computing capabilities and use it along with Hive(Data warehousing Interface) for providing the searching and ad hoc reporting on it through Intellicus.

I had initially done a bit of RnD on this front before this task was assigned to me officially. So that helped me in understanding the nuisances of the technology and hence I could start working on this right away.

After a small session on hadoop and hive with Sanjay and Sunil from Ilabs Team we had clearly identified the gaps on which we were required to work to meet the laid down requirements.

Before I move ahead on this let me try and explain some of the buzz works that might be handy when you read and try to understand the content and context of this blog.

Hadoop - A open source distributed computing framework from apache. Uses Map/Reduce for distributed computation and HDFS for data storage/retrieval requirements.

Hive - The Data warehousing Interface for hadoop for searching and ad hoc querying on the stored data.

Map/reduce - A problem solving tech. originally devised by Google that breaks a problem into smaller parts and then computes the final results by collaborating the intermediate results obtained by computing the smaller problem. Works on the philosophy of "Moving data for computation is always costlier than provide computation logic to the data itself"

HDFS - Hadoop Distributed File System is basically used to provide an abstraction to store the huge data in distributed form on multiple nodes and provide a single logical view on that storage. Uses data replication for providing fault tolerance. Data is divided into smaller blocks and stored on Data Nodes instead of a single machine.

Now let me discuss how we went ahead with the implementation for the same.

We started with using the Apcahe JDBC driver that comes with the hive client installation and used it to communicate with the hive server. The Server tends to run on the same machine which also runs the hadoop server.

After starting the hadoop server that includes the namedNode, dataNodes(both for HDFS) and JobTracker(For Map/Reduce Jobs) on my server machine(Ubuntu 10.04 server) I configured and started the Hive server on the same machine to enable its communication with my hadoop server installation.

The Hive server is basically responsible for accepting the requests through JDBC/ODBC interfaces and convert them into Map/reduce Jobs that work on data stored on HDFS and provide the results back to the caller in form of a resultset.

Though the data is always stores on the HDFS as flat files, hive does creates a logical RDBMS layer over those flat files to give a relational view to the data being stored through Hive. Therefore, this allows the user to perform searching and querying on the same data much in a similar way one would do it on any other relational database like Oracle, MySQL etc....

Now going back to how we used the Apache JDBC driver, Let me tell you that it comes with a very limited set of implemented methods which are required to make a driver JDBC compatible. Though in Intellicus, we do not use all the JDBC methods yet we required a good number of Implementations to be done to make it compatible for our usage. Hence I worked along with Sunil and identified those gaps and he helped me with those implementation.

Once we had completed those implementation, we started using the modified driver for Intellicus. Some of the gaps that we worked on were like:

1. Implementing the Database Metadata Methods.
2. Correcting the logic of returning the column Names in case the query is executed as a Map/Reduce Job.
3. Other trivial issues related to creation of Statement and resultset Objects in Hive JDBC driver.

Sunil and Sanjay from Impetus ILabs are also planning to push these enhancements as contribution to the hadoop open source implementation and probably then we will be able to get these changes as a part of the next release of hive.

After completing these changes, It was pretty simple to integrate Hive within Intellicus as our architecture allows us to Integrate with any JDBC compatible data source in a very simple way. We just wrote one class and it was all done.

Though we had some problems in terms of the handling the date datatype as Hive does not have an in-built support for this data type, yet we were able to get though this as it supports long data type. So we started converting the date type operations into unix-timestamp(long) type operations before submitting our queries to the hive server. This is being done intentionally to allows us to use the capabilities of hadoop for computing the data in a distributed environment instead of bring in any performance bottlenecks at our end.

Now, as the hard part is done... we need to further explore the possibilities of using the hadoop capabilities for meeting the custom business needs of our customers. Hope we will come though that challenge as well.

Until next time keep reading.....

6 comments:

  1. Also read this new release:

    http://www.marketwire.com/press-release/Pioneers-Ad-Hoc-Reporting-Platform-Support-Hadoop-Framework-Large-Data-Analysis-via-1270707.htm

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Very informative and self explainatory blog !!!
    keep posting and enlightening us :)

    ReplyDelete
  4. Hi Sajal, i am working in sth similar to what you did and have to modify the Apache JDBC Driver too. Do you think it is possible for me to get a copy of the modifications you did to hive jdbc driver? Thanks

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Hi, is it required to run the hive server in the same machine where hadoop is configured? Is it possible to setup the hive server in a separate machine and configure it to call the hadoop server during starting up?

    Kindly reply.

    ReplyDelete