Cross posted from my company blog post
When searching for a person on WhitePages.com, geographic location plays an important role. Often, a user won’t know the exact location of the person they are searching for, but they will know approximately where the person lives or works. Here is how we work behind the scenes to connect the dots so that our users can find the most accurate and up-to-date contact information for the people they are intending to locate.
Radius Expansion
A great approach to expanding place names is to use radius expansion. Instead of looking for people in Seattle and the adjacent towns, we look for people within some number of miles of the center of Seattle. However, when working with really big data sets in Solr like we do, radius searching can present some problems.
One approach is to use a bounding box to create a range-based query based on upper and lower bounds for latitude and longitude fields. The problem with this approach is that Lucene will have to potentially scan some very large document sets. For example, a latitudinal range for a San Diego query might also include Dallas-Fort Worth. As Lucene performs this query, it will create one document set for the latitudinal range and one for the longitudinal range. It must then intersect these geographic sets along with the matching document sets for other query terms. To improve query latency and maximize our cache effectiveness, we want to minimize the size of these document sets as much as possible.
We needed to come up with an index that allows us to scan a smaller set of documents than using the range-based approach above. Fortunately, Local Lucene includes some great tools that can be used to do this.
Cartesian Tiers
Cartesian tiers map locations to a series of boxes at increasing levels of precision. At each tier level, the box gets smaller and more precise. We can then query for a number of specific Cartesian tier boxes based on the level of precision we required.
The spatial library provided with Lucene comes with code to plot any number of tiers for a given latitude and longitude. Additionally, the library can calculate best fit and a list of tier box IDs given latitude, longitude and radius. Based on a tier box ID, we can use the Lucene index to list the documents in the index that have been tagged with that box ID. This gives you everything you need to build a query based on fields containing tier IDs. At WhitePages we do this using a custom QParserPlugin based on the one provided by IBM.
These tiers allow us to minimize the number of potential matching documents quite easily. However, without creating many granular tiers, we can’t use them for scoring matches based on their distance from the search location. We can’t create highly accurate tiers without blowing up the size of the index.
Geohashing
To solve this we use the geohashing support provided by Local Lucene. A geohash is a latitude/longitude geocode system that encodes a precise location within a string encoding. Geohash strings consist of lowercase letters and numbers, encoding the latitude and longitude, using base 32 as defined by the table below.
The even bits indicate the longitude and the odd bits indicate the latitude. Each additional pair of bits, further divides the space by a factor of two. For example, to represent the location:
Latitude: 47.723896
Longitude: -122.227071
Longitude: -122.227071
Using binary, we would obtain the binary strings:
Longitude: 001010010001010100111101101000
Latitude: 110000111101111110111101100000
Latitude: 110000111101111110111101100000
This could then be represented by the geohash (precision 12):
Geohash: c23p6xugyg40
An important property of this algorithm is that shorter strings will contain the locations of longer strings. Trading off with precision can decrease index size.
We can then implement a custom query that decodes the geohash to return a score based on the distance. Again, at WhitePages we do this using a custom QParserPluging based on the IBM spatial samples.
Alternate Geospatial Approaches
There are several other systems and approaches out there that could potentially be used. For example, a geohash has the unique property that you can decrease the precision of the geohash by simply looking at a subset of the geohash string. The geohash for the lat/long [47.723896, -122.227071] is “c23p6xugyg40” with a precision of 12. At a precision of 8, the geohash “c23p6xug” represents a projection that includes this location within an approximately 38m x 38m region. It is possible to come up with an algorithm that given a lat/long and distance, it will provide the minimal set of geohash strings to search for. This algorithm would allow us to eliminate the need for Cartesian tiers and we could perform filtering and scoring in one step. Note however, that using Cartesian tiers as a filter allows us to effectively cache tier filters for commonly searched areas. For instance, the various tiers around New York will generally always be cached for us in production.
Another notable system is the Military Grid Reference System, which is the geocoordinate system used by NATO militaries for locating points on the earth. There is currently no support for this built into Solr/Lucene though it could certainly be added in a similar way to the current Cartesian tier and geohash support.
1 comment:
Interesting article - thanks for putting it up.
If I understand this correctly, the above approach essentially assumes a completely flat database.
How many search requests might be expected on say a 4 core dedicated server for a 10 million record in say, 1 minute?
Post a Comment