Efficient Geographic Coverage in SQL Server

Part 0: The Problem…

A fairly common problem in the real world is to make sure you have enough of something — hospitals, gas stations, water towers — to accommodate the need in a particular area.  In Health care, where I finally got to tackle this problem in the real world, this is called “Network Adequacy”.  In most places, the metric is basically the percentage of insured people that are within a certain distance of medical providers.  The requirements change based on the provider type too… it may be important to have 90% of the population within 60 miles of a hospital, but you’d want an urgent care or primary care facility within, say, 20 miles of that 90%.  The problem is compounded when you have to take into account the specialty of the providers.  Being within 20 miles of a dentist, but 100 miles of a pediatric doctor or an OB/GYN is not sufficient.

It’s important to note that the distance is typically computed in actual travel time; it doesn’t help to be directly across the river from a hospital if you don’t have a bridge or a boat available.  That version of the problem is wildly complicated but can be solved by tapping a travel time API such as google maps.  These are complicated calculations; determining the distance between two points (a member and a doctor) in one second is probably doing pretty well.

But we have to do this for hundreds of thousands of members and tens of thousands of providers, for each provider type and specialty.  That’s millions upon millions of seconds… we don’t have that sort of time.  So we have to find a way to quickly pare down the results.  The first and most obvious way is to eliminate the “drive time” requirement and focus first on the as-the-bird-flies distance… making some assumptions based on fastest interstate speeds, we know at least that we can’t do BETTER than a straight line, so at least we can limit the number of driving paths we have to calculate, and this is something we can do in pure SQL.

Part I: The Naive but State of the Art Solution

Most modern relational databases have geospatial features that allow you to “quickly” (simply at least) perform calculations such as the distance between two points without worrying much about the math, and the math is quite complex.  There are plenty of references for this sort of math, but even assuming the earth is a sphere (it’s not — oil companies and pilots that need less than a 0.25% error must treat the earth as an ellipsoid), there is some advanced trigonometry involved.

SQL Server (since that’s what we’re working with at the moment, but other databases have similar capabilities), has a specific datatype called “geography” to help handle this.  You can call

geography::POINT(Latitude, Longitude, SRID)

to generate the point.  The SRID is the part that tells SQL Server if it should use an elliptical earth or a spherical one, or one of any number of minor variations.  More than you ever wanted to know about this is available on a white paper here. We use SRID 4326, which is pretty standard.

If you have an address, you can get the Latitude and Longitude through many means, but for bulk US addresses I must recommend the people at the open source Data Science Toolkit.

Anyway, once you have it set up it’s fairly trivial to geocode all the points and store them in a database.  At that point, you can create a spatial index on the geography column and life is good.  Then, for each member, we run a nearest neighbor algorithm on the providers based on these geography columns and we can relatively painlessly count people that are within so many miles of a voodoo priest, or whatever we’re going for.

The dataset we’re using now runs about as efficiently as SQL Server will do natively with this query:

select MEMBER_ID
from MemberTable m
cross apply (select top(1) PROVIDER_ID from ProviderTable p with(index(geopoint_index))
       where p.geopoint.STDistance(m.geopoint) <= (1609.34 * 10)
       and p.ProviderType = 'PCP'

This returns a list of only the members that have a PCP provider within 10 miles. This works OK, but not great. A few things about the query:

  • It does not require a hint, but we put one in for clarity.  Sometimes it is faster WITHOUT the hint, if the ProviderType filter prunes enough results to limit the number of comparisons enough that the STDistance calculation can work without the index
  • The 1609.34 * 10 is converting meters to miles.  We want Primary Care Physicians within 10 miles in this example.
  • We can’t pre-calculate this — i.e. store a nearest neighbor for each member — because the filters and distances change based on need, and we need at least 20 something combinations that are subject to frequent change.  It’s doable, but it takes database space and it’s not hugely elegant.
  • I say “not great” because the run time on any given query can be around 10 minutes with our data-set and hardware, even if we’re using the best indexes.  There’s much discussion about the black art of spatial index tuning, and it’s fascinating, but we’re pretty confident we eked out the last bit of speed with a naive query.

Part II:  The next best optimization on the web…

Someone has already published a very clever optimization that improves upon this, and I’d be remiss if I didn’t sing its praises somewhat.  The basic code looks like this:

DECLARE @start FLOAT = 1000;
WITH NearestPoints AS
   SELECT TOP(1) WITH TIES *, T.g.STDistance(@x) AS dist
   FROM Numbers JOIN T WITH(INDEX(spatial_index))
   ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)
   FROM NearestPoints
   ORDER BY n, dist

It’s based on the idea that SQL Server’s indexes are better at limiting the search for nearby neighbors if the distances are small, so this optimized SQL starts with a small distance and basically loops over increasing distances (exponentially) until it finds a match.   This is all due to how SQL Server indexes these geographies… the full article is available from Microsoft, but the short version is that it splits the world into a grid and splits each section of that grid into grids and so on, four times.  Each point is then indexed by which section of each grid it belongs.  That way, if grids are, say, 1000 meters on a side, and you want to find points within 500 meters of a spot, you can narrow your search to only a few grids nearby the spot.  Here’s the visual from Microsoft:

 This is an improvement, but it has a few issues for our application:

  • Our requirement that we filter on provider type and specialty mean that the query still spins on mismatched results… this affects filter selection; if we choose the spatial index then we lose the ability to ignore providers of unwanted types, which is very expensive, but if we choose an index based on type and specialty, we lose the performance improvement of the STDistance algorithm due to the geospatial index.
  • We still can’t pre-calculate and cache everything without adding a large number of new columns to account for the varied distances, types, and specialties.
  • The query doesn’t do well if it doesn’t find a match, and takes some tweaking to not go beyond a specific distance.  That is, we could encounter situations where someone isn’t within any reasonable distance of a doctor, or that there’s even no doctor in our network of a given specialty.  This query will spin for a while (until it runs out of numbers, I suppose), and eventually return NULL.  This could be easily tweaked, but it’s still not as efficient as we need.

Part III:  Dreaming Up Something Better

The balance we finally struck is based on something I came up with while I was asleep a few days ago (or at least, either while I was going to bed or waking up that night in some twilight sleep).  I’m super excited about it, which is obviously why I made you wade through paragraphs of pre-material to get here.  Anyway, it comes down to triangulating and comparing distances the way we learned in Boy Scouts.

I selected three more-or-less-arbitrary points.  One in the south-western most point of Kentucky, one in the northernmost, and one  in the easternmost.  For each member and provider I built three columns containing the distance, in miles, to each of these three points respectively.

Three points around Kentucky.
Three points around Kentucky.

Now our query becomes:

select MEMBER_ID
from MemberTable m
cross apply (select top(1) PROVIDER_ID from ProviderTable p with(index(geopoint_index))
           abs(m.distance_1 - p.distance_1) < 10
       and abs(m.distance_2 - p.distance_2) < 10
       and abs(m.distance_3 - p.distance_3) < 10
       and p.geopoint.STDistance(m.geopoint) <= (1609.34 * 10)
       and p.ProviderType = 'PCP'

What’s going on here?  Well, if you didn’t take an orienteering class in Boy Scouts, or if you’ve misplaced your geometry text, here’s the simple version:  you can tell where any point is by its distance from three other points (as long as those three points aren’t in a straight line, that can cause a bit of confusion).  That is, if I know a member is 100 miles from point A, 150 miles from point B, and 200 miles from point C, I know exactly where they are.  This won’t work with one point, because they could be anywhere in a circle, and it normally won’t work with two points because the circles around two points meet twice (unless you’re exactly half way between them).

You can see this in the image below… this shows circles roughly 15 miles in width that measure the distance from each blue point to the red dot near the middle of KY.  The three circles only intersect in one point (indicating the person at that point).  Note that there are a few places outside Kentucky where two circles intersect, which is precisely why we need three.

Triangulating a person in Kentucky
Triangulating a person in Kentucky

While the triangulation of a single point is perfect, comparing distances is imprecise.  If you enlarge the image you can see that the pattern where all three circles cross is an odd curved hexagon.  If a provider were in one of the corners of that hexagon it would pass the three-distance test but still be more than 10 miles away.  This is why we need the STDistance function at the end of the query.  Still, the triangulation quickly eliminates most providers and it never eliminates a point that is actually within distance.  The actual run time here is an order of magnitude faster than the other approaches we tried.  Here’s the rundown for this version:

  • Pre-calculating the three distances is very fast (three distance calculations for each member and provider — much less than calculating each and every distance between the two groups), takes up a known amount of space in the database, and only needs to be changed for addresses that change or new records (compared to pre-calculating a nearest neighbor, where every record would need to be re-evaluated if a new provider were added).  This is manageable.
  • Geospatial tuning is not needed.  We added an index to the distance columns in each table, but those are traditional indexes; this query runs quickly with no geospatial index (because very few STDistance calls are actually made).  Some day, when you can create a geospatial index WITH other columns, this may be very different.
  • This works well for our problem… it may going to be as fast for a single nearest neighbor calculation, but for network adequacy, which is somewhat different than nearest neighbor, this is an improvement.

I’m super pumped about this anyway.  It’s probably not as exciting to you, but this is the sort of thing that makes me feel good. 🙂


Adding a quick lunar calendar to the data mart

I ran across an article at Popular Science citing research that correlated Heart Surgery outcomes to the Lunar Calendar. That’s right… the claim is that you are more likely to survive and to spend less time in the hospital recovering from heart surgery if you get it during a waning full moon.  We know the moon can affect us in ways that make sense… for example an article popped up this week about the full moon disturbing sleep — it’s brighter, so it’s harder to sleep.  Ok.  And crime increases during full moons could make some sense too due to increased night visibility or such.  But medical outcomes?

Well, this gives us something new to play with, so let’s find out!  First, I’m working on the data warehouse I’ve been working on for the last few months (wow, almost a year).  It has a well behaved date dimension table, so we’re going to add a phase-of-the-moon calculation.  The basic logic for doing that is:

  • Lunar month is 29.530588853 days
  • Jan 1, 1900 at 13:52 PM GMT was a full moon
  • There are 9 indicators of the Lunar cycle, but it basically has 4 phases:
    • Waxing Crescent when Lunar month is in the first quarter (29.530588853/4)
    • Waxing Gibbous when Lunar month is in the second quarter
    • Waning Gibbous when Lunar month is in the third quarter
    • Waning Crescent when Lunar month is in the last quarter ( > 29.530588853*3/4)
  • The other five are specific indicators, which we will round to the nearest day:
    • New Moon is day 1
    • Dark Moon is day > 28.53
    • First Quarter is day between 6.88 and 7.88 (29.530588853 / 4 +- 0.5)
    • Full Moon is day between 14.265 and 15.265 (29.530588853/2 +- 0.5)
    • Last Quarter is day between 21.648 and 22.648 (29.530588853 *3 / 4 +- 0.5)

Continue reading Adding a quick lunar calendar to the data mart

R Slopegraphs and the HHP Leaderboard

I’m still working on my visualization-fu, so when the Heritage Health Prize finally got announced, the final scores provided a simple source of data that I wanted to investigate.

I’ve written about the HHP before.  After spending three years with the competition, the winners were announced at Health Datapalooza just a few days ago.  Prior to the announcement, the teams had been ranked based on a 30% sample of the final data, so it was of some interest to see what happened to the scores against the full 100%.  For one thing, I personally dropped from 80th place to 111th, and the winners of the $500,000 prize jumped from 4th place to take the prize… not an unheard of jump, but given the apparent lead of the top 3 teams it was somewhat unexpected.  The results were published on the HHP site, but I scraped them manually into a .csv format for a little simpler manipulation.  An Excel file with the raw and manipulated data is attached here:  HHP Final Standings for convenience.

A decent visualization for this before-and-after style information is the slopegraph.    Here’s an example:

hhp_slopegraph_top_50 (720)
Top 50 Teams

Continue reading R Slopegraphs and the HHP Leaderboard

Quick Python for ICD-10 XML Parsing

ICD-10 coding is a hot topic in medical data circles this year.  The short version is that, when you visit a doctor, they have a standard set of codes for both the Diagnoses and the Procedures relevant to your visit.  ICD, which stands for “International Classification for Diseases” has been around since 1900… that’s right, 113 years of standard medical coding and we still have a mess of healthcare data.  Ugh.  But ICD-9, which was the first to formally include Procedure codes (as ICPM) and not just Diagnoses, started in 1979 and is due for a facelift.

ICD-10 is the facelift, and it’s a pretty large overhaul.  Where ICD-9 had over 14,000 diagnosis codes, ICD-10 has over 43,000.  Many U.S. laws (mostly those that are touched by HIPAA) are requiring adherance to ICD-10 by October, 2014,  spawning a flurry of headless-chickens, and a rich field for consulting and the spending of lots of money.

Enter my job.  I’m trying to graft the “official” ICD9/10 crosswalk and code data into a Data Warehouse, in preparation for the analysis that needs to follow.  Naturally, I go and download the official data from here:  (Broken link, see update) and set of in SSIS to get things moving, because that’s what we use here.

UPDATE (2016-08):  The cms.gov links change annually, and the old ones die apparently… the latest is here: https://www.cms.gov/Medicare/Coding/ICD10/2017-ICD-10-CM-and-GEMs.html but I’m not going to keep udpating it.  Search Google for cms.gov ICD10.  Also, a very nice SEO person from zogmedia.com pointed this out, in a bit of a linkbaiting message, but, hey, they have a point, and they were cool about it… they wanted me to link to this site which may be better updated:  http://www.nuemd.com/icd-10/codes.

SSIS is plagued with issues.  I really must say that I don’t like it.  Having worked with everything from Informatica (obnote: I own some INFA stock) to mysqlimport via bash shell for ETL,  SSIS is low on my list.  In particular, for this project, when trying to load the XML files provided by CMS, SSIS complained that it can’t handle XML with mixed content in the XMLSource widget.  Once I tweaked the .xsd (which I shouldn’t have to do) to get around this, it complained of special characters in fields and got too frustrating to deal with.  Yes, there are alternatives in SSIS, but most involve coding in Visual Basic or C# and STILL using the SSIS tool.  This is a monolithic hammer to handle a very simple problem.

Look, all I really want is a list of codes and descriptions from the XML document.  There is a LOT of other useful metadata in there, but for now, it can wait.  Here’s a simple (not robust) solution in a handful of python lines:

import xml.etree.ElementTree as ET
import csv

csvwriter = csv.writer(open('diagnostics.csv', 'wb'))

tree = ET.parse('ICD10CM_FY2013_Full_XML_Tabular.xml')
root = tree.getroot()

for diag in root.iter('diag'):           # Loop through every diagnostic tree
   name = diag.find('name').text.encode('utf8')  # Extract the diag code
   desc = diag.find('desc').text.encode('utf8')  # Extract the description
   csvwriter.writerow((name,desc))       # write to a .csv file

And there we have a .csv which is much easier to load with whatever tool we want. This works well for the other XML files as well such as the DIndex and EIndex files, except for some reason they use different, optional, tags for their hierarchies… “mainTerm”s are the parent diagnostic codes and “term”s are the optional children. I’ll leave that as an exercise, though, it’s not too bad. 😉