October 19, 2011

HBaseStorage and PIG

Cross posted from my company blog post


We’ve been using PIG for analytics and for processing data for use in our site for some time now. PIG is a high level language for building data analysis programs that can run across a distributed Hadoop cluster. It has allowed us to scale up our data processing while decreasing the amount of time it takes to run jobs.

When it came time to update our runtime data storage for the site, it was natural for us to consider using HBase to achieve horizontal scalability. HBase is a distributed, versioned, column-oriented store based on Hadoop. One of the great advantages of using HBase is the ability to integrate it with our existing PIG data processing. In this post I will introduce you to the basics of working with HBase from your PIG scripts.

Getting Started


Before getting into the details of using HBaseStorage there are a couple of environment variables you will need to make sure are set so that HBaseStorage can work correctly.

export HBASE_HOME=/usr/lib/hbase
export PIG_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$PIG_CLASSPATH"

First, you will need to let HBaseStorage know where to find the HBase configuration, thus the HBASE_HOME environment variable. Second, the PIG_CLASSPATH needs to be extended to include the classpath for loading HBase. If you are using PIG 0.8.x there is a slight variation:

export HADOOP_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$HADOOP_CLASSPATH"


Hello World


Let’s write a simple script to load some data from a file and write it out to an HBase table. To begin, use the shell to create your table:

jhoover@jhoover2:~$ hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.90.3-cdh3u1, r, Mon Jul 18 08:23:50 PDT 2011

hbase(main):002:0> create 'sample_names', 'info'
0 row(s) in 0.5580 seconds

Next, we’ll put some simple data in a file ‘input.csv’:

1, John, Smith
2, Jane, Doe
3, George, Washington
4, Ben, Franklin

Then we’ll write a simple script to extract this data and write it into fixed columns in HBase:

raw_data = LOAD 'sample_data.csv' USING PigStorage( ',' ) AS (
    listing_id: chararray,
    fname: chararray,
    lname: chararray );

STORE raw_data INTO 'hbase://sample_names' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage (
        'info:fname info:lname');

Then run the pig script locally:

jhoover@jhoover2:~/hbase_sample$ pig -x local hbase_sample.pig

Success!

Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0001 raw_data MAP_ONLY hbase://hello_world,

Input(s):
Successfully read records from: "file:///autohome/jhoover/hbase_sample/sample_data.csv"

Output(s):
Successfully stored records in: "hbase://sample_names"

Job DAG:
job_local_0001

You can then see the results of your script in the hbase shell:

hbase(main):001:0> scan 'hello_world'
ROW COLUMN+CELL
1 column=info:fname, timestamp=1356134399789, value= John
1 column=info:lname, timestamp=1356134399789, value= Smith
2 column=info:fname, timestamp=1356134399789, value= Jane
2 column=info:lname, timestamp=1356134399789, value= Doe
3 column=info:fname, timestamp=1356134399789, value= George
3 column=info:lname, timestamp=1356134399789, value= Washington
4 column=info:fname, timestamp=1356134399789, value= Ben
4 column=info:lname, timestamp=1356134399789, value= Franklin
4 row(s) in 0.4850 seconds

Sample Code


You can download the sample code from this blog post here

Next: Column Families


In PIG 0.9.0 we get some new functionality around being able to treat entire column families using maps. I’ll post some examples as well as some UDFs we wrote to support that next.

April 14, 2011

How to Implement Geospatial Search Using Solr

Cross posted from my company blog post


When searching for a person on WhitePages.com, geographic location plays an important role. Often, a user won’t know the exact location of the person they are searching for, but they will know approximately where the person lives or works. Here is how we work behind the scenes to connect the dots so that our users can find the most accurate and up-to-date contact information for the people they are intending to locate.


Radius Expansion



A great approach to expanding place names is to use radius expansion. Instead of looking for people in Seattle and the adjacent towns, we look for people within some number of miles of the center of Seattle. However, when working with really big data sets in Solr like we do, radius searching can present some problems.


One approach is to use a bounding box to create a range-based query based on upper and lower bounds for latitude and longitude fields. The problem with this approach is that Lucene will have to potentially scan some very large document sets. For example, a latitudinal range for a San Diego query might also include Dallas-Fort Worth. As Lucene performs this query, it will create one document set for the latitudinal range and one for the longitudinal range. It must then intersect these geographic sets along with the matching document sets for other query terms. To improve query latency and maximize our cache effectiveness, we want to minimize the size of these document sets as much as possible.




We needed to come up with an index that allows us to scan a smaller set of documents than using the range-based approach above. Fortunately, Local Lucene includes some great tools that can be used to do this.


Cartesian Tiers



Cartesian tiers map locations to a series of boxes at increasing levels of precision. At each tier level, the box gets smaller and more precise. We can then query for a number of specific Cartesian tier boxes based on the level of precision we required.




The spatial library provided with Lucene comes with code to plot any number of tiers for a given latitude and longitude. Additionally, the library can calculate best fit and a list of tier box IDs given latitude, longitude and radius. Based on a tier box ID, we can use the Lucene index to list the documents in the index that have been tagged with that box ID. This gives you everything you need to build a query based on fields containing tier IDs. At WhitePages we do this using a custom QParserPlugin based on the one provided by IBM.


These tiers allow us to minimize the number of potential matching documents quite easily. However, without creating many granular tiers, we can’t use them for scoring matches based on their distance from the search location. We can’t create highly accurate tiers without blowing up the size of the index.


Geohashing



To solve this we use the geohashing support provided by Local Lucene. A geohash is a latitude/longitude geocode system that encodes a precise location within a string encoding. Geohash strings consist of lowercase letters and numbers, encoding the latitude and longitude, using base 32 as defined by the table below.




The even bits indicate the longitude and the odd bits indicate the latitude. Each additional pair of bits, further divides the space by a factor of two. For example, to represent the location:


Latitude: 47.723896

Longitude: -122.227071


Using binary, we would obtain the binary strings:


Longitude: 001010010001010100111101101000

Latitude: 110000111101111110111101100000


This could then be represented by the geohash (precision 12):


Geohash: c23p6xugyg40


An important property of this algorithm is that shorter strings will contain the locations of longer strings. Trading off with precision can decrease index size.


We can then implement a custom query that decodes the geohash to return a score based on the distance. Again, at WhitePages we do this using a custom QParserPluging based on the IBM spatial samples.


Alternate Geospatial Approaches



There are several other systems and approaches out there that could potentially be used. For example, a geohash has the unique property that you can decrease the precision of the geohash by simply looking at a subset of the geohash string. The geohash for the lat/long [47.723896, -122.227071] is “c23p6xugyg40” with a precision of 12. At a precision of 8, the geohash “c23p6xug” represents a projection that includes this location within an approximately 38m x 38m region. It is possible to come up with an algorithm that given a lat/long and distance, it will provide the minimal set of geohash strings to search for. This algorithm would allow us to eliminate the need for Cartesian tiers and we could perform filtering and scoring in one step. Note however, that using Cartesian tiers as a filter allows us to effectively cache tier filters for commonly searched areas. For instance, the various tiers around New York will generally always be cached for us in production.


Another notable system is the Military Grid Reference System, which is the geocoordinate system used by NATO militaries for locating points on the earth. There is currently no support for this built into Solr/Lucene though it could certainly be added in a similar way to the current Cartesian tier and geohash support.

March 25, 2010

Configuring REST for ActiveMQ

I recently decided I wanted to try out the REST interfaces for ActiveMQ. My assumption going into this was that it was implemented as a transport connector much like STOMP and OpenWire. Adding to my confusion was the fact that there is a transport connector called "http". If you check out the samples, you will see that the demo application uses the REST interface via a MessageServlet in the demo webapp. Likewise AjaxServlet implements the AJAX interface used by amq.js.

To use these things you will need to embed them in a servlet container. The default ActiveMQ configuration using Jetty:

 <import resource="jetty.xml"/> 


The Jetty bean config includes a demo webapp:

 <beans>  
<jetty xmlns="http://mortbay.com/schemas/jetty/1.0">
<connectors>
<nioConnector port="8161"/>
</connectors>
<handlers>
<webAppContext contextPath="/demo" resourceBase="${activemq.base}/webapps/demo" logUrlOnStart="true"/>
</handlers>
</jetty>
</beans>


Then in the WEB-INF/web.xml for your web application, you simply include the MessageServlet:

  <servlet>  
<servlet-name>MessageServlet</servlet-name>
<servlet-class>org.apache.activemq.web.MessageServlet</servlet-class>
<load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping>
<servlet-name>MessageServlet</servlet-name>
<url-pattern>/rest/*</url-pattern>
</servlet-mapping>


As you can see it's pretty easy from there to customize where you map your MessageServlet, choose your port, etc. You'll note also that ActiveMQ includes a bunch of other useful administrative webapps in the Jetty instance.

October 26, 2009

Mac OSX Terminal Settings for a sane shell

I just switched from a two year old MBP to a new 17" one. First off, this thing is amazing. It's a bigger footprint, but weighs about the same. Not surprising since it is so much thinner. The battery lifetime is outrageous. I've gone most of a day at work without a charger. The screen is gorgeous and very easy to read. So that's all great.

However, I had to recreate all my settings from my last one. Since I had to dig this up from scratch again, I'll record here so that I don't have to do that again in the future. Maybe this is useful for someone else as well:

Launch Terminal.app and open preferences. Go to keyboard tab and select "Use option as meta-key". Next, enter the following keycodes:
  • control cursor down - \033Ob
  • control cursor left - \033Od
  • control cursor right - \033Oc
  • control cursor up - \033Oa
  • end - \033[4~
  • page down - \033[6~
  • page up - \033[5~
  • shift cursor down - \033[b
  • shift cursor left - \033[d
  • shift cursor right - \033[c
  • shift cursor up - \033[a
  • shift end - \033[F
  • shift home - \033[H
  • shift page down - \033[6~
  • shift page up - \033[5~
Finally, go to "Advanced" and select "Delete sends Ctrl-H".

This isn't totally sane though. I don't know the good codes for "shift-page-up" and I'm too lazy to go spend a few hours messing with terminfo.

April 3, 2008

Animated GIFs and externally hosted images in Flash

We're about to change the way some of our flash works. One of the problems that come is how we can load an animated GIF into one of our flash applications. Well, it turns out someone built a (not free) component for this:

How to load animated GIFs using Adobe Flex 2.0

Next problem, we need to display images that don't have a friendly crossdomain.xml letting us access them. Someone has solved this too (free):

Ignoring crossdomain.xml policy with AS3

Hurray

April 1, 2008

Berkeley DB Replication & Ruby on Leopard

I've been messing around with Berkeley DB & Ruby this evening as an experiment. I wanted to try out their replication support along with master election. This may wind up being a faster way of doing master election than the PAXOS library I wrote for ruby. Anyways, getting replication enabled in Ruby BDB on Leopard wasn't straightforward. The macports versions of Berkeley DB don't have threads enabled so the replication manager isn't there. So...

First build DB46 with threads

download db-4.6.21 from oracle

tar -zxf db-4.6.21.tar.gz
cd db-4.6.21/build_unix
../dist/configure --with-prefix=/usr/local --enable-cxx --disable-tcl --enable-pthread_api
make
sudo make install

Then build BDB

wget ftp://moulon.inra.fr/pub/ruby/bdb.tar.gz
tar -zxf bdb.tar.gz
cd bdb-0.6.2
ruby extconf.rb --with-db-dir=/usr/local/BerkeleyDB.4.6 --with-db-version=-4.6
make
sudo make install

You should now have a working Ruby/Berkely DB setup with replication working, hurray.

August 10, 2007

Why use parens in Ruby?

Here is another ruby gotcha. Ruby allows you to skip parenthesizing method calls. These two lines are equivelent:

set_table_name "users"
set_table_name("users")

In some cases this will make your code mildly more readable. In practice though this is dangerous. Consider this code:

if arr1.include? "foo" && !arr2.include? "bar" then
  # do something
end

Ruby order of operations says that && will be evaluated prior to calling my_array.include?. So first you evaluate:

"foo" && !arr2.include? "bar"

Which always returns true or false. Then you evaluate;

if arr1.include? true
  # do something
end

Which is clearly not what you intended. This code is correct:

if arr1.include?("foo") && !arr2.include?("bar") then
  # do something
end

As a rule, you should never call a function without parenthesizing the parameters. Even if you are writing code which is not in an IF statement or is in an IF statement, but is not a part of boolean logic, you should still parenthesize it because it is possible someone else will come along and change your code not realizing this gotcha exists. Code defensively, use parens.