The Happy Technologist Interesting Geekdom


Efficient Geographic Coverage in SQL Server

Part 0: The Problem...

A fairly common problem in the real world is to make sure you have enough of something -- hospitals, gas stations, water towers -- to accommodate the need in a particular area.  In Health care, where I finally got to tackle this problem in the real world, this is called "Network Adequacy".  In most places, the metric is basically the percentage of insured people that are within a certain distance of medical providers.  The requirements change based on the provider type too... it may be important to have 90% of the population within 60 miles of a hospital, but you'd want an urgent care or primary care facility within, say, 20 miles of that 90%.  The problem is compounded when you have to take into account the specialty of the providers.  Being within 20 miles of a dentist, but 100 miles of a pediatric doctor or an OB/GYN is not sufficient.

It's important to note that the distance is typically computed in actual travel time; it doesn't help to be directly across the river from a hospital if you don't have a bridge or a boat available.  That version of the problem is wildly complicated but can be solved by tapping a travel time API such as google maps.  These are complicated calculations; determining the distance between two points (a member and a doctor) in one second is probably doing pretty well.

But we have to do this for hundreds of thousands of members and tens of thousands of providers, for each provider type and specialty.  That's millions upon millions of seconds... we don't have that sort of time.  So we have to find a way to quickly pare down the results.  The first and most obvious way is to eliminate the "drive time" requirement and focus first on the as-the-bird-flies distance... making some assumptions based on fastest interstate speeds, we know at least that we can't do BETTER than a straight line, so at least we can limit the number of driving paths we have to calculate, and this is something we can do in pure SQL.

Part I: The Naive but State of the Art Solution

Most modern relational databases have geospatial features that allow you to "quickly" (simply at least) perform calculations such as the distance between two points without worrying much about the math, and the math is quite complex.  There are plenty of references for this sort of math, but even assuming the earth is a sphere (it's not -- oil companies and pilots that need less than a 0.25% error must treat the earth as an ellipsoid), there is some advanced trigonometry involved.

SQL Server (since that's what we're working with at the moment, but other databases have similar capabilities), has a specific datatype called "geography" to help handle this.  You can call

geography::POINT(Latitude, Longitude, SRID)

to generate the point.  The SRID is the part that tells SQL Server if it should use an elliptical earth or a spherical one, or one of any number of minor variations.  More than you ever wanted to know about this is available on a white paper here. We use SRID 4326, which is pretty standard.

If you have an address, you can get the Latitude and Longitude through many means, but for bulk US addresses I must recommend the people at the open source Data Science Toolkit.

Anyway, once you have it set up it's fairly trivial to geocode all the points and store them in a database.  At that point, you can create a spatial index on the geography column and life is good.  Then, for each member, we run a nearest neighbor algorithm on the providers based on these geography columns and we can relatively painlessly count people that are within so many miles of a voodoo priest, or whatever we're going for.

The dataset we're using now runs about as efficiently as SQL Server will do natively with this query:

select MEMBER_ID
from MemberTable m
cross apply (select top(1) PROVIDER_ID from ProviderTable p with(index(geopoint_index))
       where p.geopoint.STDistance(m.geopoint) <= (1609.34 * 10)
       and p.ProviderType = 'PCP'

This returns a list of only the members that have a PCP provider within 10 miles. This works OK, but not great. A few things about the query:

  • It does not require a hint, but we put one in for clarity.  Sometimes it is faster WITHOUT the hint, if the ProviderType filter prunes enough results to limit the number of comparisons enough that the STDistance calculation can work without the index
  • The 1609.34 * 10 is converting meters to miles.  We want Primary Care Physicians within 10 miles in this example.
  • We can't pre-calculate this -- i.e. store a nearest neighbor for each member -- because the filters and distances change based on need, and we need at least 20 something combinations that are subject to frequent change.  It's doable, but it takes database space and it's not hugely elegant.
  • I say "not great" because the run time on any given query can be around 10 minutes with our data-set and hardware, even if we're using the best indexes.  There's much discussion about the black art of spatial index tuning, and it's fascinating, but we're pretty confident we eked out the last bit of speed with a naive query.

Part II:  The next best optimization on the web...

Someone has already published a very clever optimization that improves upon this, and I'd be remiss if I didn't sing its praises somewhat.  The basic code looks like this:

DECLARE @start FLOAT = 1000;
WITH NearestPoints AS
   SELECT TOP(1) WITH TIES *, T.g.STDistance(@x) AS dist
   FROM Numbers JOIN T WITH(INDEX(spatial_index))
   ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)
   FROM NearestPoints
   ORDER BY n, dist

It's based on the idea that SQL Server's indexes are better at limiting the search for nearby neighbors if the distances are small, so this optimized SQL starts with a small distance and basically loops over increasing distances (exponentially) until it finds a match.   This is all due to how SQL Server indexes these geographies... the full article is available from Microsoft, but the short version is that it splits the world into a grid and splits each section of that grid into grids and so on, four times.  Each point is then indexed by which section of each grid it belongs.  That way, if grids are, say, 1000 meters on a side, and you want to find points within 500 meters of a spot, you can narrow your search to only a few grids nearby the spot.  Here's the visual from Microsoft:

 This is an improvement, but it has a few issues for our application:

  • Our requirement that we filter on provider type and specialty mean that the query still spins on mismatched results... this affects filter selection; if we choose the spatial index then we lose the ability to ignore providers of unwanted types, which is very expensive, but if we choose an index based on type and specialty, we lose the performance improvement of the STDistance algorithm due to the geospatial index.
  • We still can't pre-calculate and cache everything without adding a large number of new columns to account for the varied distances, types, and specialties.
  • The query doesn't do well if it doesn't find a match, and takes some tweaking to not go beyond a specific distance.  That is, we could encounter situations where someone isn't within any reasonable distance of a doctor, or that there's even no doctor in our network of a given specialty.  This query will spin for a while (until it runs out of numbers, I suppose), and eventually return NULL.  This could be easily tweaked, but it's still not as efficient as we need.

Part III:  Dreaming Up Something Better

The balance we finally struck is based on something I came up with while I was asleep a few days ago (or at least, either while I was going to bed or waking up that night in some twilight sleep).  I'm super excited about it, which is obviously why I made you wade through paragraphs of pre-material to get here.  Anyway, it comes down to triangulating and comparing distances the way we learned in Boy Scouts.

I selected three more-or-less-arbitrary points.  One in the south-western most point of Kentucky, one in the northernmost, and one  in the easternmost.  For each member and provider I built three columns containing the distance, in miles, to each of these three points respectively.

Three points around Kentucky.

Three points around Kentucky.

Now our query becomes:

select MEMBER_ID
from MemberTable m
cross apply (select top(1) PROVIDER_ID from ProviderTable p with(index(geopoint_index))
           abs(m.distance_1 - p.distance_1) < 10
       and abs(m.distance_2 - p.distance_2) < 10
       and abs(m.distance_3 - p.distance_3) < 10
       and p.geopoint.STDistance(m.geopoint) <= (1609.34 * 10)
       and p.ProviderType = 'PCP'

What's going on here?  Well, if you didn't take an orienteering class in Boy Scouts, or if you've misplaced your geometry text, here's the simple version:  you can tell where any point is by its distance from three other points (as long as those three points aren't in a straight line, that can cause a bit of confusion).  That is, if I know a member is 100 miles from point A, 150 miles from point B, and 200 miles from point C, I know exactly where they are.  This won't work with one point, because they could be anywhere in a circle, and it normally won't work with two points because the circles around two points meet twice (unless you're exactly half way between them).

You can see this in the image below... this shows circles roughly 15 miles in width that measure the distance from each blue point to the red dot near the middle of KY.  The three circles only intersect in one point (indicating the person at that point).  Note that there are a few places outside Kentucky where two circles intersect, which is precisely why we need three.

Triangulating a person in Kentucky

Triangulating a person in Kentucky

While the triangulation of a single point is perfect, comparing distances is imprecise.  If you enlarge the image you can see that the pattern where all three circles cross is an odd curved hexagon.  If a provider were in one of the corners of that hexagon it would pass the three-distance test but still be more than 10 miles away.  This is why we need the STDistance function at the end of the query.  Still, the triangulation quickly eliminates most providers and it never eliminates a point that is actually within distance.  The actual run time here is an order of magnitude faster than the other approaches we tried.  Here's the rundown for this version:

  • Pre-calculating the three distances is very fast (three distance calculations for each member and provider -- much less than calculating each and every distance between the two groups), takes up a known amount of space in the database, and only needs to be changed for addresses that change or new records (compared to pre-calculating a nearest neighbor, where every record would need to be re-evaluated if a new provider were added).  This is manageable.
  • Geospatial tuning is not needed.  We added an index to the distance columns in each table, but those are traditional indexes; this query runs quickly with no geospatial index (because very few STDistance calls are actually made).  Some day, when you can create a geospatial index WITH other columns, this may be very different.
  • This works well for our problem... it may going to be as fast for a single nearest neighbor calculation, but for network adequacy, which is somewhat different than nearest neighbor, this is an improvement.

I'm super pumped about this anyway.  It's probably not as exciting to you, but this is the sort of thing that makes me feel good. 🙂



Two Years with the Heritage Health Prize

ED: I spoke to a reporter yesterday for a half hour or so, discussing the final stretch of the Heritage Health Prize data mining competition I've been a competitor in for the past couple of years. Her article came out today and is posted here: 3-Million-Health-Puzzler-Draws-to-a-Close. I'm quoted as saying only: "They set the bar too high". I probably said that; I said a lot of things, and I don't want to accuse Cheryl of misquoting me (she was quite nice, and her article is helpful, well written, and correct), but I feel like a lot of context was missed on my comment, so I'm just going to write an article of my own that helps explain my perspective... I've been meaning to blog more anyway. 🙂

On April 4th 2011, a relatively unknown company called "Kaggle" opened a competition with a $3 Million bounty to the public. The competition was called the "Heritage Health Prize", and it was designed to help healthcare providers determine which patients would benefit most from preventive care, hopefully saving the patients from a visit to the hospital, and saving money at the same time. And not just a little money either ... the $3 Million in prize money pales in comparison to the billions of dollars that could be saved by improving preventive care. The Trust for America's Health estimates that spending $10 in preventive care per person could save $16 billion per year, which is still just the tip of the iceberg for soaring health care prices in the United States.


Big Data: It’s not the size that matters

Many an article has been spent defining "Big Data"... everyone agrees that "Big Data" must be, well, large, and made up of data. There may be seemingly new ways of handling big data:tools such as Hadoop and R (my personal favorite) and concepts like No-SQL databases, and an explosion of data due to new collection tools: faster and more prolific sensors, higher quality video, and social websites. Large companies with the wherewithal to build petabyte and larger data centers are learning to collect and mine this data fairly effectively, and that's all very exciting -- there's a wealth of knowledge to be gleaned from all this data. But what about the rest of us?

The thing is, it's not really a matter of collecting and hoarding a large amount of data yourself. It's how you use and take advantage of the data that you do have available that is at the core of these new trends.


Datamining, Privacy, and Ethics

[I'm trying to write shorter blog posts these days -- let's see how that goes]

There was a lot of chatter recently around about how Target (the shopping chain) has used data mining to identify pregnant shoppers in an effort to woo them as loyal customers. This is a prime example of things that are of direct interest to me: data mining, privacy, and the ethics surrounding the vast amount of knowledge we can compile about everything today, so I thought I'd share my perspective.

First off, the NYT article should not have been a surprise to anyone familiar with data. I've worked very closely with data mining teams on large retailers, insurance companies, and government agencies, and they uncover correlations all the time that lead to spooky predictability. The classic example of this is correlated sales of diapers and beer (from

A number of convenience store clerks, the story goes, noticed that men often bought beer at the same time they bought diapers. The store mined its receipts and proved the clerks' observations correct. So, the store began stocking diapers next to the beer coolers, and sales skyrocketed.

One common interpretation was that a new father was sent out in the night to get much needed diapers, which put him in the mood to buy a six-pack.  Of course, that last part is purely subjective, but that's the story.

The article goes on to call this a "myth," but even if the specific case isn't verifiable, the decades-old example is on point for what it describes: Everyone™ is trying to make money by learning about predictable patterns, then exploiting those patterns to achieve their goals.  This has been going on for thousands of years at a very human level in sales: vendors put up shops in high traffic areas, they're careful what they put in plain view to attract customers, they offer sales on one item and try to get you to buy more things once you're there, they give better prices to loyal customers.  Think of those examples in a modern shopping mall, then think of them in an ancient city square. It's not hard to imagine examples in both places.


A Moment for the HTC EVO 3D

I've been a Sprint user for over 10 years, at least according to Amanda, who cheerfully explained to me why my cell phone bill never makes sense but that they appreciate my loyalty anyway as she sold me my new phone a couple of weeks ago.

My new phone is the HTC EVO 3D, but enough about that for now. First, it's important to talk about my PREVIOUS phone, which was the very underrated Samsung Moment.

The Moment was a very early generation Android phone which managed to hit just about all the design elements I wanted. Despite being a bit too slow (Angry Birds never played quite right) and missing out on simple things like multi-touch which Samsung apparently left out to get it to market quickly and inexpensively, it was one of my favorite phones. My Moment was a replacement for a Palm Treo which dutifully kept by my side for several years (forever in smartphone land).

Samsung MomentThe Moment was a wide slide-out keyboard styled phone. If anyone is reading this that designs phone keyboards, go pick this up and play with it -- it's the best. The keys are clearly separated and slightly raised so that touch-typing, such that is is on a teensy-weensy keyboard is actually possible. I wasn't quite able to bang out entire novellas without looking, but I could get pretty far into a decent text message with minimal mistakes while watching Netflix. The keyboard rocked.

Moreover, the Moment set aside the typical 4-button Android interface (Home, Menu, Back, Search) that seems prolific, instead opting for the three required buttons (Home, Menu, Back), and two buttons dedicated to phone operation (Pickup and Hangup, where the Hangup button also acted as a power button for the overall phone). Most importantly, though, the phone had a tiny touchpad that depressed as a select button. I haven't seen better cursor control on any smart phone, although the Palm and the Blackberry dedicated rollerballs and rockers are fairly close.

The HTC EVO 3D with which I now entertain myself boasts none of this coolness. The more-than-4-inch screen is gorgeous, responsive (the phone is wicked fast), and I've whittled down the on-screen keyboard options to a few that I like (I'm currently using SwiftKey X which has a curious habit of predicting words when nothing has been typed -- it currently assumes that I want to say "I am a beautiful person." if I don't give it any other starting letters). But it's not as cute or cuddly as the Samsung Moment.

But, and this is very important:



Marbled Rye and Evolutionary Algorithms

Since no one has yet taken it upon themselves to write my unauthorized biography, it falls to me to make the following piece of information available to the public: I like to bake.

Breads and pies mostly -- I've got a couple of recipes posted here, including a pie crust that I'm pretty happy about, and a few things I've borrowed from other people. I've made a few rhubarbs lately that really turned out quite well.

One thing I recently attempted, was a Marbled Rye. This isn't a terribly difficult bread to make -- there are recipes everywhere. I was mostly pleased with it, though -- I didn't have any Caraway seeds, which add a lot of flavor, but the bread looked nice and better than a lot that I've made lately.
Chip's First Marbled Rye
One thing I experimented with, though, was yeast.

Yeast is one of those things I don't really understand. This is because the most I remember about the biological classification taxonomy was that everything was an "Animal", "Vegetable", or "Mineral" -- I have no idea which one a yeast would be. This was a problem for biologists as well, so in 1990 they changed the top three domains to be "Archaea," "Bacteria," and "Eukaryota," which has helped me in no way whatsoever because not only do I still not know which one yeast would be, but I no longer know which one I'm supposed to be, and I much preferred back when I was an Animal and the world made sense.

Anyway, yeast are largely responsible for the existence of Bourbon, which automatically qualifies them as A Good Thing™ no matter what biologists call them. Baker's yeast, which makes us happy, is "Saccharomyces cerevisiae" (note the interesting comment in Wikipedia about Crohn's and Colitis on that page -- I never knew that), and lives everywhere, so it's pretty easy to get hold of. You can leave potato-starch filled water out for a while and yeast will just show up. All it does, really, is convert sugar into bubbles and alcohol. In breads, the bubbles (Carbon Dioxide) make the breads rise... in alcohols, the alcohol well, makes the alcohol alcoholic. Yeast is glorious.


Computing on the cheap with Amazon EC2

We here at the happy technologist tend to host our own servers, because we like it, and because we can. (We also speak in the third person when there's just one of us, but there's no accounting for some people). Nothing fancy, mind you... for now, a handful of websites are running on an Ubuntu virtual machine through VirtualBox on a Windows 7 (or maybe Vista, I forget) box that otherwise serves as a Media Center. It's actually simpler than it sounds.

Lately, though, what with the Heritage Health Prize and a lot of hours spent learning and playing with data mining techniques, the poor little server has been called upon to do much more intensive work. It's routinely running simulations and calculations all night long and it's really not built for that. The fan has started humming heroically (i.e. loudly), which isn't always best for a media center.

Noone wants their media center to hate them, or to catch fire.

Enter Amazon EC2. That stands for Elastic Compute Cloud. See how clever that is -- what they did with that 2 there? Rather than go "ECC", they just counted the C twice and made it like a math or a chemistry equation. These Amazon guys are some serious funny. I'm actually very impressed with the setup they have. There's a wealth of options for configuring the virtual servers -- public AMIs (preconfigured images) are available for most major software vendor platforms, from the expected Oracle, Microsoft, and Linux offerings to MicroStrategy, R, Elastic Bamboo, Citrix, and even BitCoin configured software. Public data sets are available should you need them, advanced storage, database, failover, clustering, networking, identity management, queuing, notification, and probably a million other things at pennies-per-hour prices.

At the moment, I'm running a simulation on a 20-CPU 1.6 Terabyte beast of a machine for $0.228 per hour. This is the sort of thing that infuses me with glee. It's easily outperforming my media center by 30:1.


Unhappy Bits about Bitcoins

I wandered across bitcoin not too long ago, during some random web crawling, and downloaded it in May. I installed it, ran it, realized I was behind a firewall, killed it, uninstalled it and forgot about it for a couple of weeks until this Wired article came out and sent the whole world a'twitter about bitcoin again.

The Wired article, in short, talks about an underground website that sells illicit drugs and whose sole allowable currency is the Bitcoin. The website itself is shrouded in anonymity in the TOR network which itself is an excellent little piece of technology which I'm planning on running out of space to describe here just now, but you should look into it.

The Bitcoin spiked in popularity. You can buy and sell Bitcoins in open marketplaces such as Mt Gox (whatever that means) or Lillion Transfer if you're using some more international currencies, or you can use them directly on sites that take them, such as this Alpaca sock store. Prices quickly went from a few dollars to around $30, although they've now backed off a bit to around $20/BTC (Bitcoin).

Ok, so where are we? We can buy cocaine and alpaca socks with Bitcoins. Great. But what ARE they, again? How can you get some, and should you care?


Mosaics and More Algorithm Love

Photomosaic of Dad and Mom

My mom (whose website I should update) recently celebrated her birthday. My mom is an avid shutterbug, and abuses the digital camera we got for her a handful of Christmases ago to the tune of 10,000 photos a year, give or take. Our basement is piled with boxes of images just begging to be scanned, cataloged, sorted, and all that from back when mom was a film user (you all remember film, right?). I daydream of software that will actually be useful to that endeavor -- to say nothing of how fun it would be to digitize the piles of super-8mm movie film that goes along with it -- but so far we haven't made much of a dent.

In the mean time, though, I have a hard drive which contains about 150,000 photos from my mom's library... basically a full off-site backup in case horror happens to her computer. (ObNote: This is good practice, boys-and-girls, you should all go about giving hard drives away, with complete backups of your stuff in case of a disaster... you'll thank me when the revolution comes). Anyway, I was looking around for a way to leverage this wealth of digital media for a birthday present and decided to go with a photomosaic.


Dayton Technology Landscape Conference

Technology First is a local IT Trade Group, and their second annual "Technology Landscape Conference" was yesterday, so I dutifully (duty = I'm dating their intern) attended.

Ok, so there was some more duty... one of the companies presenting was ExpeData, a Dayton, Ohio (which is "local" for us folk) company who has a digital writing capture technology. We've been working with them for a few months to find some suitable applications and to discuss some security issues and requirements. It's a fairly interesting technology, although I have some trouble finding its killer-app.

Another interesting company whose presentation I attended was Persistent Surveillance Systems -- these guys have a 190+ MegaPixel camera array that they fly over the Cincinnati area (among others), taking pictures about once per second. When they hear about a crime, typically a murder, after the fact, they can go back and assign analysts to review the captured images to track people in the vicinity. Their software allows analysts to assign colored tracks and markers to people, vehicles, and anything else of interest -- they initially track suspects, then go back and track anyone they interacted with, anyone nearby (possible witnesses/accomplices), and whatnot. The large pixel view of the city and long video times allow them to watch people drive all the way to their destination -- a home, hideout, friends' house, or whatever -- where they can then work with police to get a warrant and follow up as appropriate. Their metadata is even good enough that they can apparently cross reference locations to find that, for example, the getaway driver from murder A may have lived next door to the suspect from murder B, which may help detectives tie together previously unrelated crimes.