Cross posted from my company blog post
We’ve been using PIG for analytics and for processing data for use in our site for some time now. PIG is a high level language for building data analysis programs that can run across a distributed Hadoop
When it came time to update our runtime data storage for the site, it was natural for us to consider using HBase to achieve horizontal scalability. HBase is a distributed, versioned, column-oriented store based on Hadoop. One of the great advantages of using HBase is the ability to integrate it with our existing PIG data processing. In this post I will introduce you to the basics of working with HBase from your PIG scripts.
Getting Started
Before getting into the details of using HBaseStorage there are a couple of environment variables you will need to make sure are set so that HBaseStorage can work correctly.
export HBASE_HOME=/usr/lib/hbase
export PIG_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$PIG_CLASSPATH"
First, you will need to let HBaseStorage know where to find the HBase configuration, thus the HBASE_HOME environment variable. Second, the PIG_CLASSPATH needs to be extended to include the classpath for loading HBase. If you are using PIG 0.8.x there is a slight variation:
export HADOOP_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$HADOOP_CLASSPATH"
Hello World
Let’s write a simple script to load some data from a file and write it out to an HBase table. To begin, use the shell to create your table:
jhoover@jhoover2:~$ hbase shell
HBase Shell; enter 'help
Type "exit
Version 0.90.3-cdh3u1, r, Mon Jul 18 08:23:50 PDT 2011
hbase(main):002:0> create 'sample_names', 'info'
0 row(s) in 0.5580 seconds
Next, we’ll put some simple data in a file ‘input.csv’:
1, John, Smith
2, Jane, Doe
3, George, Washington
4, Ben, Franklin
Then we’ll write a simple script to extract this data and write it into fixed columns in HBase:
raw_data = LOAD 'sample_data.csv' USING PigStorage( ',' ) AS (
listing_id: chararray,
fname: chararray,
lname: chararray );
STORE raw_data INTO 'hbase://sample_names' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'info:fname info:lname');
Then run the pig script locally:
jhoover@jhoover2:~/hbase_sample$ pig -x local hbase_sample.pig
…
Success!
Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0001 raw_data MAP_ONLY hbase://hello_world,
Input(s):
Successfully read records from: "file:///autohome/jhoover/hbase_sample/sample_data.csv"
Output(s):
Successfully stored records in: "hbase://sample_names"
Job DAG:
job_local_0001
You can then see the results of your script in the hbase shell:
hbase(main):001:0> scan 'hello_world'
ROW COLUMN+CELL
1 column=info:fname, timestamp=1356134399789, value= John
1 column=info:lname, timestamp=1356134399789, value= Smith
2 column=info:fname, timestamp=1356134399789, value= Jane
2 column=info:lname, timestamp=1356134399789, value= Doe
3 column=info:fname, timestamp=1356134399789, value= George
3 column=info:lname, timestamp=1356134399789, value= Washington
4 column=info:fname, timestamp=1356134399789, value= Ben
4 column=info:lname, timestamp=1356134399789, value= Franklin
4 row(s) in 0.4850 seconds
Sample Code
You can download the sample code from this blog post here
Next: Column Families
In PIG 0.9.0 we get some new functionality around being able to treat entire column families using maps. I’ll post some examples as well as some UDFs we wrote to support that next.