User:Premkumar.dara/sandbox

Introduction
Development of Hive started around 08/2007 - and the open source release happened around 08/2008 (see https://issues.apache.org/jira/b... ). Much more development in Hive happened after this time period (than in it) - and it continues to be in active development at Facebook and in the open source community.

Hive started off as a layer on top of Hadoop streaming. Developers really loved scripting languages and there was no chance in hell we were going to coax them all into writing Java map-reduce. The starting data set contained a heavily nested data structure serialized using Thrift. It essentially represented pretty much everything we knew about a Facebook user (at that time). Again - developers *loved* it. The problem for us was to give an easy way to access this data set from the streaming framework.

History
The first baby steps of Hive fell out from this. A 'select' expression could be supplied (on top of Hadoop Streaming) - the expression allowed access to nested/indexed elements inside Thrift objects. The resulting values few well known tables - with known deserializers and thrift classes, delimiters etc.). Yeah - life was fun those days. Also - there were no code reviews.

Soon enough - other people were adding stuff. A join expression syntax and a group by syntax were soon layered on over Hadoop streaming. The metastore was moved off from the code and stored as flat files. At this point - Hive was production and being used in important projects (Ads/Feed). In March of 2008 - we presented Hive at the Hadoop Summit and promised to make an open source version available soon.were serialized and sent to mappers. The first metastore was hard coded in code (there were a

It was quite clear though by this point that the current madness could not continue. We had duplicated significant code for each of these operators and could not stand the code-base. We would be frequently ashamed of walking people through Hive syntax (basically a warren of command line switches) - although the users didn't seem to mind! It was also clear that we had to get computation off Python/Php and into inbuilt operators that were more efficient.

Versions
I had written, as part of working with the Feed team - what became - a rather complicated MR job to rank friends by mutual friends. In doing so I had pretty much used every Hadoop trick in the bag (partitioners, separate map and reduce sorting keys, comparators, in-memory hash tables and so on) and realized how hard it was to write an optimal MR job (particularly on large data sets). Assembling data into complex data structures was also painful. I really wanted to see these types of operators exposed in a high level declarative form so that the average user would never have to go through this. Fortunately - our team had Oracle veterans well versed in the art of SQL.

Soon enough - we were working on what became known internally as Hive-2 - a clean start all over again on Hive. Hive-2 eventually became the open source Hive that was first contributed as a Hadoop contributed project and later became an Apache project. While we were working on this - we would maintain, minimally, Hive-1 (as the old version was known) - since it was already in production. We were also starting to hit serious scale in terms of number of Hadoop machines. But fortunately the team added some stellar folks - and development happened rapidly. In true Facebook culture - every person in the Hive team contributed to production projects and operations, while working on Hive-2 on the side.

One feature that was most requested in Hive-1 - and that was never built (intentionally) was the 'where' clause. If ever there were questions around what good-ness Hive-2 would bring (and why many engineers were working on it) - we could always cite the 'where' clause :-).

We also had something called the Eric test (with obvious reference to someone famous in the Hadoop community). Would Eric be able to download and play with Hive immediately? It was a goal towards simplicity - and it led to interesting choices. Hive had to release with an embedded metastore. It was not acceptable to have to have to bring up a mysql db, create a schema and so on - just to try out Hive. Similarly - one should just be able to point to some files on local disk - and run SQL on them (via Map-Reduce using Hadoop's local mode). A real Hadoop cluster was not required to try out Hive. No additional Hadoop installation was required - one just had to point Hive at an existing Hadoop install - and off it would go.

Hive-2 was released internally in Facebook before contributing to open source. Making sure that Hive-2 could work on Hive-1 tables was a complete nightmare - the heroic effort of one engineer. But it most _just worked_. Hive-2 quickly displaced Hive-1. Hive-1 was surprisngly hard to kill though. Hive-1 jobs continued to last into 2011 (I think) to much grief of everyone tasked with supporting this ecosystem. And even today - traces of Hive-1 code live on in in the current Hive codebase.

We kind of lied our way into making Hive a Hadoop Contrib project. It was clear to us that this was a major project and development (and bugs) would continue for eons. But posing as a small (and of limited impact) contrib project in Hadoop had tremendous benefits of distribution with the standard Hadoop repository. And that's how Hive entered the (open) world.

Hive - A Petabyte Scale Data Warehouse using Hadoop
A number of engineers from Facebook are speaking at the Yahoo! Hadoop Summit today about the ways we are using Hadoop and Hive for analytics. Hive is an open source, peta-byte scale date warehousing framework based on Hadoop that was developed by the Data Infrastructure Team at Facebook. In this blogpost we'll talk more about Hive, how it has been used at Facebook and its unique architecture and capabilities.

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis and business intelligence applications used by analysts across the company, a number of Facebook products are also based on analytics. These products range from simple reporting applications like Insights for the Facebook Ad Network, to more advanced kind such as Facebook's Lexicon product. As a result a flexible infrastructure that caters to the needs of these diverse applications and users and that also scales up in a cost effective manner with the ever increasing amounts of data being generated on Facebook, is critical. Hive and Hadoop are the technologies that we have used to address these requirements at Facebook.

When we started at Facebook in 2007 all of the data processing infrastructure was built around a data warehouse built using a commercial RDBMS. The data that we were generating was growing very fast - as an example we grew from a 15TB data set in 2007 to a 2PB data set today. The infrastructure at that time was so inadequate that some daily data processing jobs were taking more than a day to process and the situation was just getting worse with every passing day. We had an urgent need for infrastructure that could scale along with our data and it was at that time we then started exploring Hadoop as a way to address our scaling needs. The fact that Hadoop was an open source project that was already used at petabyte scale and provided scalability using commodity hardware was a very compelling proposition for us. Moreover, the same jobs that had taken more than a day to complete could now be completed within a few hours using Hadoop.

However, using Hadoop was not easy for end users, specially for the ones who were not familiar with map/reduce. End users had to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacked the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis. It was very clear to us that in order to really empower the company to analyze this data more productively, we had to improve the query capabilities of Hadoop. Bringing this data closer to users is what inspired us to build Hive. Our vision was to bring the familiar concepts of tables, columns, partitions and a subset of SQL to the unstructured world of Hadoop, while still maintaining the extensibility and flexibility that Hadoop enjoyed. Hive was open sourced in August 2008 and since then has been used and explored by a number of Hadoop users for their data processing needs.

Right from the start, Hive was very popular with all the users internally at Facebook. Today we regularly run thousands of jobs on the cluster with hundreds of users using this system for a wide variety of applications. Hive/Hadoop cluster at Facebook stores more than 2PB of uncompressed data and routinely loads 15 TB of data daily. It is heavily used for simple summarization jobs, business intelligence and machine learning and many other applications.

In the following sections we provide more details about Hive architecture and capabilities.

Data Model
Hive organizes data in tables and partitions. As an example, advertisement impressions can be stored in an ad_impressions table which is partitioned by date, e.g. 2009-05-01 partition for May 1, 2009 data and 2009-04-31 for April 31, 2009 data. The data for a particular date goes into a partition for that date. A good partitioning scheme allows Hive to prune data while processing a query and that has a direct impact on how fast a result of the query can be produced, e.g. queries on the impressions for a single day do not have to process data for other days. A good partitioning scheme also helps in managing the data and opens up possibilities to archiving older data to cheaper storage while keeping the more recent data on the main cluster. File System

Behind the scenes, Hive stores partitions and tables into directories in Hadoop File System (HDFS). In the previous example the table ad_impressions could be mapped to /warehouse/ad_impressions while each of the partitions can be mapped to /warehouse/ad_impressions/ds=2009-05-01 and /warehouse/ad_impressions/ds=2009-04-31 where ds (date stamp) is a partitioning column. The partitioning scheme can have multiple columns as well in which case each partitioning column maps to a level within the directory name space. Within each partition the data can be bucketed into individual files. The bucketing can be random or hashed on a particular column. Setting up a good bucketing scheme helps in fast sampling of the data which can be used for fast analysis and data verification on sampled data.

The columns in the tables and partitions can be either primitive types such as integers, doubles, floats and strings, or can be complex types such as maps, lists and structures. Moreover, Hive can store and query arbitrary user defined types through its extensibility mechanisms as described later in this post.

System Architecture and Components
Hive comprises of the following major components:


 * Metastore: To store the meta data.
 * Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop.
 * SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types.
 * UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions).
 * Clients: Command line client similar to Mysql command line and a web UI.

The metastore stores the information about the tables, partitions, the columns within the tables, the SerDes used to serialize and deserialize from partitions, the locations of the data for these units within hdfs, bucketization attributes etc. It also provides a basic key value store which can be used to store any kind of meta data with these objects.

The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query

SELECT * FROM t where t.c = 'xyz'

is converted into a single map only job while a query of the form

SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)

is converted into a single map/reduce job that computes the join.

More complex plans are also possible, e.g. in some instances a group by of the form

SELECT t1.c1, count(1) from t1 group by t1.c1

may be converted into 2 map/reduce jobs in case there is skew in the data on c1. The first map/reduce job computes partial aggregates by randomly distributing data while the second one takes the results from the first job and distributes them by hashing on c1 to compute the final aggregation.

Hive also supports more advanced query constructs like FROM clause sub queries and UNION ALL. The compiler also creates smart plans through a number of optimizations such as:


 * Predicate pushdown
 * Combining multiple map/reduce jobs to single map/reduce jobs
 * Column pruning
 * Map-side aggregations
 * Map-side joins

Once the plan is generated the compiler passes it to the execution engine that executes the map/reduce jobs, launching them to the Hadoop cluster in the order of their dependencies.

Together all these components work to provide a powerful framework and language to manage and query data. Apart from the query components the language framework also provides normal DDL capabilities like creating, deleting and modifying tables and partitions and describing and showing tables as well.

Data Formats, User Defined Types and User Defined Functions

Hive is flexible enough to parse different data formats and data types through its SerDe and ObjectInspector Java interfaces. It provides some built in implementations of these interfaces for commonly occurring data formats such as delimited text files, csv files and some efficient binary formats. However, if the data is not in any of these formats (e.g. it may be XML data or in some proprietary format that is used by other applications), it is easy to write a conforming implementation class of the SerDe interface, drop it into a jar at a known location and instruct Hive to use the implemented class to serialize and deserialize the data stored in that particular partition.

The same kind of flexibility is available for rich data sets. With the built in implementations of the ObjectInspector Java interface, it is fairly easy to come up with ObjectInspectors for User Defined Data Types, e.g. a Point data type that comprises of two floating point numbers, can be programed as a PointObjectInspector class that implements how this data type can be represented in Hive's processing engine. Additionally, the user can program and register their own User Defined Functions to work on such data types and use them in SQL, e.g. on the PointObjectInspector, the user can define functions like get_X and get_Y to return the x and y co-ordinates of a point and then can use these functions in a query such as

SELECT get_X(T.point), get_Y(T.point) FROM points_table T;

where the points_table has a single point column which is of the PointObjectInspector type and is stored in a file using the PointSerDe.

The ability to program and register custom functions (both simple functions and aggregate functions) also allows the users to extend Hive in areas where functionality may not have been provided out of the box. It is fairly easy to code up such function implementations using the UDF(User Defined Functions) and UDAF(User Defined Aggregate Functions) Java interfaces, package these implementations into a jar at a known location and instruct Hive to pick those classes up to be used in SQL statements, e.g while a today function - that gives the current date - is not provided by Hive out of the box, it is very easy to code this up and register it with Hive so that it can be used in the queries. This is shown in the following example.

CREATE TEMPORARY FUNCTION today AS 'yourpackage.UdfTodayImplementation'; SELECT * FROM test_tab T where T.dt < today;

All these abilities make Hive a very flexible and extensible system when it comes to supporting different data formats, data types or functions.

Extensibility and Writing Your Own Map/Reduce

Hive also provides a lot of flexibility in supporting computations that may not completely fit the SQL processing model. Machine learning model generation, building different kinds of indexes and many other applications are not easily expressible in SQL and Hive is able to support such applications by providing language constructs that allow the user to plug in their own transformation scripts in an SQL statement. This is done through the MAP, REDUCE, TRANSFORM, DISTRIBUTE BY, SORT BY and CLUSTER BY keywords in the SQL extensions that the Hive Query Language supports. As an example, the canonical word count example on documents stored in a table can be programmed using map/reduce and Hive as follows:

CREATE TABLE docs(contents STRING); FROM

(MAP docs.contents USING 'tokenizer_script' AS word, cnt FROM docs CLUSTER BY word) map_output

REDUCE map_output.word, map_output.cnt USING 'count_script' AS word, cnt;

In the example above the tokenizer_script and count_script can be implemented in any language - php, python, shell script, java, C - anything really. While the tokenizer_script emits tuples for every word that it finds in contents, the count_script adds up all the 1s for each word to emit out a word count. The CLUSTER BY word clause in the statement tells Hive to distribute the map output to the reducers by hashing on word.

Such flexibility allows Hive to be used for non SQL kind of analysis as well and has been extensively used at Facebook for a variety of such tasks.

Interoperability
A hallmark of a successful system is also how it inter operates with other tools and systems. In order to address that need we have built a JDBC driver (and are currently also building an ODBC driver) for Hive. Though these are still experimental but with these drivers in place Hive can be made to inter operate with other analytics tools and repositories in a standards compliant manner. With the ODBC driver, Hive can be made to talk to BI engines like Micro strategy thereby further exposing the power of data and analytics to users through tools that they are familiar with.

Conclusion
As an open source project Hive has had a number of external contributions as well – including an open source web UI and a Columnar storage format to name a few. We continue to look for contributions from the community and are committed to building a strong community around Hive. The Hive wiki in apache is available athttp://wiki.apache.org/hadoop/Hive and contains a lot of information about the Hive open source community and more details about the project.

Over the last one and a half years Facebook's data team has built a number of other systems in order to make scalable data processing and analytics easier. While Hive is open source, the other technologies are not yet available in the public domain but they have played an equally important role in making analytics on large data sets a reality for our users. These include a UI for Hive called Hipal that is used extensively within Facebook for ad hoc analysis, Databee that is used extensively to manage dependencies between different data processing tasks and some data discovery and scheduling tools. Together these technologies combine to provide powerful tools for users to analyze data and build analytics based applications on top of this data.

Reference
Quora. (n.d.). Retrieved from http://www.quora.com/How-much-time-did-it-take-to-develop-Hive-at-Facebook

Facebook, Hadoop and Hive - DeveloperZen. (2009, June 16). Retrieved from http://www.developerzen.com/2009/06/16/facebook-hadoop-and-hive/

(n.d.). Retrieved from http://www.lamsade.dauphine.fr/~litwin/cours98/Doc-cours-clouds/Hive - A Petabyte Scale Data Warehouse using Hadoop _ Facebook.pdf