Here's a straightforward problem: I have a table with a lot of fields in it (in this case, several tables -- new Fact and Dimension tables in a star schema data warehouse, but, you know, any wide table will do).
I want to extract a few real world test records that exercise the entire table... a "covering set" of test cases... so if I have 100 columns, and record A has non-zero, non-null values in columns 1-50 and record B has good values in columns 51-100, then I only need to test those two records. How great is that?!
Ok, I should probably BUILD test cases, but I like using real data since there are always unseen business rules lurking about. Anyway, this is a pretty basic math problem: Select the minimal number of objects from the set of rows where the union of the viable (non-null, non-zero) columns across the subset covers all possible columns.
There's some code below. Note that it is very bad code. I use the wrong scope on global temporary tables, I don't do lots of checking of things, I generate SQL and execute it, I debug with print statements. It is also formatted poorly, but that's actually more of a wordpress/plugin issue than anything else.
But it's mine, and I love it...
Part 0: The Problem...
A fairly common problem in the real world is to make sure you have enough of something -- hospitals, gas stations, water towers -- to accommodate the need in a particular area. In Health care, where I finally got to tackle this problem in the real world, this is called "Network Adequacy". In most places, the metric is basically the percentage of insured people that are within a certain distance of medical providers. The requirements change based on the provider type too... it may be important to have 90% of the population within 60 miles of a hospital, but you'd want an urgent care or primary care facility within, say, 20 miles of that 90%. The problem is compounded when you have to take into account the specialty of the providers. Being within 20 miles of a dentist, but 100 miles of a pediatric doctor or an OB/GYN is not sufficient.
It's important to note that the distance is typically computed in actual travel time; it doesn't help to be directly across the river from a hospital if you don't have a bridge or a boat available. That version of the problem is wildly complicated but can be solved by tapping a travel time API such as google maps. These are complicated calculations; determining the distance between two points (a member and a doctor) in one second is probably doing pretty well.
But we have to do this for hundreds of thousands of members and tens of thousands of providers, for each provider type and specialty. That's millions upon millions of seconds... we don't have that sort of time. So we have to find a way to quickly pare down the results. The first and most obvious way is to eliminate the "drive time" requirement and focus first on the as-the-bird-flies distance... making some assumptions based on fastest interstate speeds, we know at least that we can't do BETTER than a straight line, so at least we can limit the number of driving paths we have to calculate, and this is something we can do in pure SQL.
Part I: The Naive but State of the Art Solution
Most modern relational databases have geospatial features that allow you to "quickly" (simply at least) perform calculations such as the distance between two points without worrying much about the math, and the math is quite complex. There are plenty of references for this sort of math, but even assuming the earth is a sphere (it's not -- oil companies and pilots that need less than a 0.25% error must treat the earth as an ellipsoid), there is some advanced trigonometry involved.
SQL Server (since that's what we're working with at the moment, but other databases have similar capabilities), has a specific datatype called "geography" to help handle this. You can call
geography::POINT(Latitude, Longitude, SRID)
to generate the point. The SRID is the part that tells SQL Server if it should use an elliptical earth or a spherical one, or one of any number of minor variations. More than you ever wanted to know about this is available on a white paper here. We use SRID 4326, which is pretty standard.
If you have an address, you can get the Latitude and Longitude through many means, but for bulk US addresses I must recommend the people at the open source Data Science Toolkit.
Anyway, once you have it set up it's fairly trivial to geocode all the points and store them in a database. At that point, you can create a spatial index on the geography column and life is good. Then, for each member, we run a nearest neighbor algorithm on the providers based on these geography columns and we can relatively painlessly count people that are within so many miles of a voodoo priest, or whatever we're going for.
The dataset we're using now runs about as efficiently as SQL Server will do natively with this query:
select MEMBER_ID from MemberTable m cross apply (select top(1) PROVIDER_ID from ProviderTable p with(index(geopoint_index)) where p.geopoint.STDistance(m.geopoint) <= (1609.34 * 10) and p.ProviderType = 'PCP'
This returns a list of only the members that have a PCP provider within 10 miles. This works OK, but not great. A few things about the query:
- It does not require a hint, but we put one in for clarity. Sometimes it is faster WITHOUT the hint, if the ProviderType filter prunes enough results to limit the number of comparisons enough that the STDistance calculation can work without the index
- The 1609.34 * 10 is converting meters to miles. We want Primary Care Physicians within 10 miles in this example.
- We can't pre-calculate this -- i.e. store a nearest neighbor for each member -- because the filters and distances change based on need, and we need at least 20 something combinations that are subject to frequent change. It's doable, but it takes database space and it's not hugely elegant.
- I say "not great" because the run time on any given query can be around 10 minutes with our data-set and hardware, even if we're using the best indexes. There's much discussion about the black art of spatial index tuning, and it's fascinating, but we're pretty confident we eked out the last bit of speed with a naive query.
Part II: The next best optimization on the web...
Someone has already published a very clever optimization that improves upon this, and I'd be remiss if I didn't sing its praises somewhat. The basic code looks like this:
DECLARE @start FLOAT = 1000; WITH NearestPoints AS ( SELECT TOP(1) WITH TIES *, T.g.STDistance(@x) AS dist FROM Numbers JOIN T WITH(INDEX(spatial_index)) ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n) ORDER BY n ) SELECT TOP(1) * FROM NearestPoints ORDER BY n, dist
It's based on the idea that SQL Server's indexes are better at limiting the search for nearby neighbors if the distances are small, so this optimized SQL starts with a small distance and basically loops over increasing distances (exponentially) until it finds a match. This is all due to how SQL Server indexes these geographies... the full article is available from Microsoft, but the short version is that it splits the world into a grid and splits each section of that grid into grids and so on, four times. Each point is then indexed by which section of each grid it belongs. That way, if grids are, say, 1000 meters on a side, and you want to find points within 500 meters of a spot, you can narrow your search to only a few grids nearby the spot. Here's the visual from Microsoft:
This is an improvement, but it has a few issues for our application:
- Our requirement that we filter on provider type and specialty mean that the query still spins on mismatched results... this affects filter selection; if we choose the spatial index then we lose the ability to ignore providers of unwanted types, which is very expensive, but if we choose an index based on type and specialty, we lose the performance improvement of the STDistance algorithm due to the geospatial index.
- We still can't pre-calculate and cache everything without adding a large number of new columns to account for the varied distances, types, and specialties.
- The query doesn't do well if it doesn't find a match, and takes some tweaking to not go beyond a specific distance. That is, we could encounter situations where someone isn't within any reasonable distance of a doctor, or that there's even no doctor in our network of a given specialty. This query will spin for a while (until it runs out of numbers, I suppose), and eventually return NULL. This could be easily tweaked, but it's still not as efficient as we need.
Part III: Dreaming Up Something Better
The balance we finally struck is based on something I came up with while I was asleep a few days ago (or at least, either while I was going to bed or waking up that night in some twilight sleep). I'm super excited about it, which is obviously why I made you wade through paragraphs of pre-material to get here. Anyway, it comes down to triangulating and comparing distances the way we learned in Boy Scouts.
I selected three more-or-less-arbitrary points. One in the south-western most point of Kentucky, one in the northernmost, and one in the easternmost. For each member and provider I built three columns containing the distance, in miles, to each of these three points respectively.
Now our query becomes:
select MEMBER_ID from MemberTable m cross apply (select top(1) PROVIDER_ID from ProviderTable p with(index(geopoint_index)) where abs(m.distance_1 - p.distance_1) < 10 and abs(m.distance_2 - p.distance_2) < 10 and abs(m.distance_3 - p.distance_3) < 10 and p.geopoint.STDistance(m.geopoint) <= (1609.34 * 10) and p.ProviderType = 'PCP'
What's going on here? Well, if you didn't take an orienteering class in Boy Scouts, or if you've misplaced your geometry text, here's the simple version: you can tell where any point is by its distance from three other points (as long as those three points aren't in a straight line, that can cause a bit of confusion). That is, if I know a member is 100 miles from point A, 150 miles from point B, and 200 miles from point C, I know exactly where they are. This won't work with one point, because they could be anywhere in a circle, and it normally won't work with two points because the circles around two points meet twice (unless you're exactly half way between them).
You can see this in the image below... this shows circles roughly 15 miles in width that measure the distance from each blue point to the red dot near the middle of KY. The three circles only intersect in one point (indicating the person at that point). Note that there are a few places outside Kentucky where two circles intersect, which is precisely why we need three.
While the triangulation of a single point is perfect, comparing distances is imprecise. If you enlarge the image you can see that the pattern where all three circles cross is an odd curved hexagon. If a provider were in one of the corners of that hexagon it would pass the three-distance test but still be more than 10 miles away. This is why we need the STDistance function at the end of the query. Still, the triangulation quickly eliminates most providers and it never eliminates a point that is actually within distance. The actual run time here is an order of magnitude faster than the other approaches we tried. Here's the rundown for this version:
- Pre-calculating the three distances is very fast (three distance calculations for each member and provider -- much less than calculating each and every distance between the two groups), takes up a known amount of space in the database, and only needs to be changed for addresses that change or new records (compared to pre-calculating a nearest neighbor, where every record would need to be re-evaluated if a new provider were added). This is manageable.
- Geospatial tuning is not needed. We added an index to the distance columns in each table, but those are traditional indexes; this query runs quickly with no geospatial index (because very few STDistance calls are actually made). Some day, when you can create a geospatial index WITH other columns, this may be very different.
- This works well for our problem... it may going to be as fast for a single nearest neighbor calculation, but for network adequacy, which is somewhat different than nearest neighbor, this is an improvement.
I'm super pumped about this anyway. It's probably not as exciting to you, but this is the sort of thing that makes me feel good. 🙂
I ran across an article at Popular Science citing research that correlated Heart Surgery outcomes to the Lunar Calendar. That's right... the claim is that you are more likely to survive and to spend less time in the hospital recovering from heart surgery if you get it during a waning full moon. We know the moon can affect us in ways that make sense... for example an article popped up this week about the full moon disturbing sleep -- it's brighter, so it's harder to sleep. Ok. And crime increases during full moons could make some sense too due to increased night visibility or such. But medical outcomes?
Well, this gives us something new to play with, so let's find out! First, I'm working on the data warehouse I've been working on for the last few months (wow, almost a year). It has a well behaved date dimension table, so we're going to add a phase-of-the-moon calculation. The basic logic for doing that is:
- Lunar month is 29.530588853 days
- Jan 1, 1900 at 13:52 PM GMT was a full moon
- There are 9 indicators of the Lunar cycle, but it basically has 4 phases:
- Waxing Crescent when Lunar month is in the first quarter (29.530588853/4)
- Waxing Gibbous when Lunar month is in the second quarter
- Waning Gibbous when Lunar month is in the third quarter
- Waning Crescent when Lunar month is in the last quarter ( > 29.530588853*3/4)
- The other five are specific indicators, which we will round to the nearest day:
- New Moon is day 1
- Dark Moon is day > 28.53
- First Quarter is day between 6.88 and 7.88 (29.530588853 / 4 +- 0.5)
- Full Moon is day between 14.265 and 15.265 (29.530588853/2 +- 0.5)
- Last Quarter is day between 21.648 and 22.648 (29.530588853 *3 / 4 +- 0.5)
I'm still working on my visualization-fu, so when the Heritage Health Prize finally got announced, the final scores provided a simple source of data that I wanted to investigate.
I've written about the HHP before. After spending three years with the competition, the winners were announced at Health Datapalooza just a few days ago. Prior to the announcement, the teams had been ranked based on a 30% sample of the final data, so it was of some interest to see what happened to the scores against the full 100%. For one thing, I personally dropped from 80th place to 111th, and the winners of the $500,000 prize jumped from 4th place to take the prize... not an unheard of jump, but given the apparent lead of the top 3 teams it was somewhat unexpected. The results were published on the HHP site, but I scraped them manually into a .csv format for a little simpler manipulation. An Excel file with the raw and manipulated data is attached here: HHP Final Standings for convenience.
A decent visualization for this before-and-after style information is the slopegraph. Here's an example:
ICD-10 coding is a hot topic in medical data circles this year. The short version is that, when you visit a doctor, they have a standard set of codes for both the Diagnoses and the Procedures relevant to your visit. ICD, which stands for "International Classification for Diseases" has been around since 1900... that's right, 113 years of standard medical coding and we still have a mess of healthcare data. Ugh. But ICD-9, which was the first to formally include Procedure codes (as ICPM) and not just Diagnoses, started in 1979 and is due for a facelift.
ICD-10 is the facelift, and it's a pretty large overhaul. Where ICD-9 had over 14,000 diagnosis codes, ICD-10 has over 43,000. Many U.S. laws (mostly those that are touched by HIPAA) are requiring adherance to ICD-10 by October, 2014, spawning a flurry of headless-chickens, and a rich field for consulting and the spending of lots of money.
Enter my job. I'm trying to graft the "official" ICD9/10 crosswalk and code data into a Data Warehouse, in preparation for the analysis that needs to follow. Naturally, I go and download the official data from here: (Broken link, see update) and set of in SSIS to get things moving, because that's what we use here.
UPDATE (2016-08): The cms.gov links change annually, and the old ones die apparently... the latest is here: https://www.cms.gov/Medicare/Coding/ICD10/2017-ICD-10-CM-and-GEMs.html but I'm not going to keep udpating it. Search Google for cms.gov ICD10. Also, a very nice SEO person from zogmedia.com pointed this out, in a bit of a linkbaiting message, but, hey, they have a point, and they were cool about it... they wanted me to link to this site which may be better updated: http://www.nuemd.com/icd-10/codes.
SSIS is plagued with issues. I really must say that I don't like it. Having worked with everything from Informatica (obnote: I own some INFA stock) to mysqlimport via bash shell for ETL, SSIS is low on my list. In particular, for this project, when trying to load the XML files provided by CMS, SSIS complained that it can't handle XML with mixed content in the XMLSource widget. Once I tweaked the .xsd (which I shouldn't have to do) to get around this, it complained of special characters in fields and got too frustrating to deal with. Yes, there are alternatives in SSIS, but most involve coding in Visual Basic or C# and STILL using the SSIS tool. This is a monolithic hammer to handle a very simple problem.
Look, all I really want is a list of codes and descriptions from the XML document. There is a LOT of other useful metadata in there, but for now, it can wait. Here's a simple (not robust) solution in a handful of python lines:
import xml.etree.ElementTree as ET import csv csvwriter = csv.writer(open('diagnostics.csv', 'wb')) tree = ET.parse('ICD10CM_FY2013_Full_XML_Tabular.xml') root = tree.getroot() for diag in root.iter('diag'): # Loop through every diagnostic tree name = diag.find('name').text.encode('utf8') # Extract the diag code desc = diag.find('desc').text.encode('utf8') # Extract the description csvwriter.writerow((name,desc)) # write to a .csv file
And there we have a .csv which is much easier to load with whatever tool we want. This works well for the other XML files as well such as the DIndex and EIndex files, except for some reason they use different, optional, tags for their hierarchies... "mainTerm"s are the parent diagnostic codes and "term"s are the optional children. I'll leave that as an exercise, though, it's not too bad. 😉
ED: I spoke to a reporter yesterday for a half hour or so, discussing the final stretch of the Heritage Health Prize data mining competition I've been a competitor in for the past couple of years. Her article came out today and is posted here: 3-Million-Health-Puzzler-Draws-to-a-Close. I'm quoted as saying only: "They set the bar too high". I probably said that; I said a lot of things, and I don't want to accuse Cheryl of misquoting me (she was quite nice, and her article is helpful, well written, and correct), but I feel like a lot of context was missed on my comment, so I'm just going to write an article of my own that helps explain my perspective... I've been meaning to blog more anyway. 🙂
On April 4th 2011, a relatively unknown company called "Kaggle" opened a competition with a $3 Million bounty to the public. The competition was called the "Heritage Health Prize", and it was designed to help healthcare providers determine which patients would benefit most from preventive care, hopefully saving the patients from a visit to the hospital, and saving money at the same time. And not just a little money either ... the $3 Million in prize money pales in comparison to the billions of dollars that could be saved by improving preventive care. The Trust for America's Health estimates that spending $10 in preventive care per person could save $16 billion per year, which is still just the tip of the iceberg for soaring health care prices in the United States.
Many an article has been spent defining "Big Data"... everyone agrees that "Big Data" must be, well, large, and made up of data. There may be seemingly new ways of handling big data:tools such as Hadoop and R (my personal favorite) and concepts like No-SQL databases, and an explosion of data due to new collection tools: faster and more prolific sensors, higher quality video, and social websites. Large companies with the wherewithal to build petabyte and larger data centers are learning to collect and mine this data fairly effectively, and that's all very exciting -- there's a wealth of knowledge to be gleaned from all this data. But what about the rest of us?
The thing is, it's not really a matter of collecting and hoarding a large amount of data yourself. It's how you use and take advantage of the data that you do have available that is at the core of these new trends.
[I'm trying to write shorter blog posts these days -- let's see how that goes]
There was a lot of chatter recently around about how Target (the shopping chain) has used data mining to identify pregnant shoppers in an effort to woo them as loyal customers. This is a prime example of things that are of direct interest to me: data mining, privacy, and the ethics surrounding the vast amount of knowledge we can compile about everything today, so I thought I'd share my perspective.
First off, the NYT article should not have been a surprise to anyone familiar with data. I've worked very closely with data mining teams on large retailers, insurance companies, and government agencies, and they uncover correlations all the time that lead to spooky predictability. The classic example of this is correlated sales of diapers and beer (from govexec.com):
A number of convenience store clerks, the story goes, noticed that men often bought beer at the same time they bought diapers. The store mined its receipts and proved the clerks' observations correct. So, the store began stocking diapers next to the beer coolers, and sales skyrocketed.
One common interpretation was that a new father was sent out in the night to get much needed diapers, which put him in the mood to buy a six-pack. Of course, that last part is purely subjective, but that's the story.
The article goes on to call this a "myth," but even if the specific case isn't verifiable, the decades-old example is on point for what it describes: Everyone™ is trying to make money by learning about predictable patterns, then exploiting those patterns to achieve their goals. This has been going on for thousands of years at a very human level in sales: vendors put up shops in high traffic areas, they're careful what they put in plain view to attract customers, they offer sales on one item and try to get you to buy more things once you're there, they give better prices to loyal customers. Think of those examples in a modern shopping mall, then think of them in an ancient city square. It's not hard to imagine examples in both places.
Note: I ramble a lot in this post, and I'm not sure I agree with everything I said, but I'm starting back to work today so I don't have a lot of time to muck with it and I'm trying to get content out, so... you've been warned. If you want some interesting reading on the topic, here's a few links:
I just entered West Virginia. This is because I'm on a trip and driving to the beach, having spent the night and dropped off a couple of dogs in Kentucky (with people [well… family], not just anywhere you know), we’re now safely ensconced in a Jeep Liberty, the four of us (two real people, two seventeen year olds) enjoying the extra space the dogs left us. Traveling this way means that you see a lot of countryside and inevitably have random conversations with family members you never see about politics and economics.
In particular, since the world economy has taken a bit of a dive lately, I figure it’s time for my personal rant on the topic. Let me start by saying that I’m not fond of economics, at least not formally. This stems mostly from an unfortunate economics teacher in college and my background in not-being-stupid. In one of our early classes, the professor drew an x-y axis on the chalkboard, placed a single data point, and after only a few moments of discussion drew a very attractive wavy line through it and called it a “supply-and-demand” curve like this:
Anyone that is not very upset by that chart should stop reading now, so that I do not offend you, and immediately go unfriend me on Facebook or put me in your “icky” circle on Google+ or something.
Here’s the math 101 short course for anyone that ignored my previous paragraph: you can’t draw a curve through only one point of data, because you don’t know which way the curve should go. It takes two points just to make a straight line, and at least three points to make a curve (and normally lots more unless you’re sure what shape the curve has). Is it a one-humped camel, or a two-humped camel? Or a sea-serpent? You get the idea. My relationship with formal economics went downhill from there.
Now that’s just the taste in my mouth – I admit it’s hugely important to study and understand how economies work. I am currently undergoing a microeconomic experiment by having just given the aforementioned 17 year olds $100 apiece to buy their own stuff for this trip so they won’t bother me and will hopefully learn the value of a dollar. Already they have passed up $7 slices of pizza to save money, so we’re learning something.