I'm still working on my visualization-fu, so when the Heritage Health Prize finally got announced, the final scores provided a simple source of data that I wanted to investigate.
I've written about the HHP before. After spending three years with the competition, the winners were announced at Health Datapalooza just a few days ago. Prior to the announcement, the teams had been ranked based on a 30% sample of the final data, so it was of some interest to see what happened to the scores against the full 100%. For one thing, I personally dropped from 80th place to 111th, and the winners of the $500,000 prize jumped from 4th place to take the prize... not an unheard of jump, but given the apparent lead of the top 3 teams it was somewhat unexpected. The results were published on the HHP site, but I scraped them manually into a .csv format for a little simpler manipulation. An Excel file with the raw and manipulated data is attached here: HHP Final Standings for convenience.
A decent visualization for this before-and-after style information is the slopegraph. Here's an example:
ICD-10 coding is a hot topic in medical data circles this year. The short version is that, when you visit a doctor, they have a standard set of codes for both the Diagnoses and the Procedures relevant to your visit. ICD, which stands for "International Classification for Diseases" has been around since 1900... that's right, 113 years of standard medical coding and we still have a mess of healthcare data. Ugh. But ICD-9, which was the first to formally include Procedure codes (as ICPM) and not just Diagnoses, started in 1979 and is due for a facelift.
ICD-10 is the facelift, and it's a pretty large overhaul. Where ICD-9 had over 14,000 diagnosis codes, ICD-10 has over 43,000. Many U.S. laws (mostly those that are touched by HIPAA) are requiring adherance to ICD-10 by October, 2014, spawning a flurry of headless-chickens, and a rich field for consulting and the spending of lots of money.
Enter my job. I'm trying to graft the "official" ICD9/10 crosswalk and code data into a Data Warehouse, in preparation for the analysis that needs to follow. Naturally, I go and download the official data from here: http://www.cms.gov/Medicare/Coding/ICD10/2013-ICD-10-CM-and-GEMs.html and set of in SSIS to get things moving, because that's what we use here.
SSIS is plagued with issues. I really must say that I don't like it. Having worked with everything from Informatica (obnote: I own some INFA stock) to mysqlimport via bash shell for ETL, SSIS is low on my list. In particular, for this project, when trying to load the XML files provided by CMS, SSIS complained that it can't handle XML with mixed content in the XMLSource widget. Once I tweaked the .xsd (which I shouldn't have to do) to get around this, it complained of special characters in fields and got too frustrating to deal with. Yes, there are alternatives in SSIS, but most involve coding in Visual Basic or C# and STILL using the SSIS tool. This is a monolithic hammer to handle a very simple problem.
Look, all I really want is a list of codes and descriptions from the XML document. There is a LOT of other useful metadata in there, but for now, it can wait. Here's a simple (not robust) solution in a handful of python lines:
import xml.etree.ElementTree as ET import csv csvwriter = csv.writer(open('diagnostics.csv', 'wb')) tree = ET.parse('ICD10CM_FY2013_Full_XML_Tabular.xml') root = tree.getroot() for diag in root.iter('diag'): # Loop through every diagnostic tree name = diag.find('name').text.encode('utf8') # Extract the diag code desc = diag.find('desc').text.encode('utf8') # Extract the description csvwriter.writerow((name,desc)) # write to a .csv file
And there we have a .csv which is much easier to load with whatever tool we want. This works well for the other XML files as well such as the DIndex and EIndex files, except for some reason they use different, optional, tags for their hierarchies... "mainTerm"s are the parent diagnostic codes and "term"s are the optional children. I'll leave that as an exercise, though, it's not too bad.
ED: I spoke to a reporter yesterday for a half hour or so, discussing the final stretch of the Heritage Health Prize data mining competition I've been a competitor in for the past couple of years. Her article came out today and is posted here: 3-Million-Health-Puzzler-Draws-to-a-Close. I'm quoted as saying only: "They set the bar too high". I probably said that; I said a lot of things, and I don't want to accuse Cheryl of misquoting me (she was quite nice, and her article is helpful, well written, and correct), but I feel like a lot of context was missed on my comment, so I'm just going to write an article of my own that helps explain my perspective... I've been meaning to blog more anyway.
On April 4th 2011, a relatively unknown company called "Kaggle" opened a competition with a $3 Million bounty to the public. The competition was called the "Heritage Health Prize", and it was designed to help healthcare providers determine which patients would benefit most from preventive care, hopefully saving the patients from a visit to the hospital, and saving money at the same time. And not just a little money either ... the $3 Million in prize money pales in comparison to the billions of dollars that could be saved by improving preventive care. The Trust for America's Health estimates that spending $10 in preventive care per person could save $16 billion per year, which is still just the tip of the iceberg for soaring health care prices in the United States.
Many an article has been spent defining "Big Data"... everyone agrees that "Big Data" must be, well, large, and made up of data. There may be seemingly new ways of handling big data:tools such as Hadoop and R (my personal favorite) and concepts like No-SQL databases, and an explosion of data due to new collection tools: faster and more prolific sensors, higher quality video, and social websites. Large companies with the wherewithal to build petabyte and larger data centers are learning to collect and mine this data fairly effectively, and that's all very exciting -- there's a wealth of knowledge to be gleaned from all this data. But what about the rest of us?
The thing is, it's not really a matter of collecting and hoarding a large amount of data yourself. It's how you use and take advantage of the data that you do have available that is at the core of these new trends.
[I'm trying to write shorter blog posts these days -- let's see how that goes]
There was a lot of chatter recently around about how Target (the shopping chain) has used data mining to identify pregnant shoppers in an effort to woo them as loyal customers. This is a prime example of things that are of direct interest to me: data mining, privacy, and the ethics surrounding the vast amount of knowledge we can compile about everything today, so I thought I'd share my perspective.
First off, the NYT article should not have been a surprise to anyone familiar with data. I've worked very closely with data mining teams on large retailers, insurance companies, and government agencies, and they uncover correlations all the time that lead to spooky predictability. The classic example of this is correlated sales of diapers and beer (from govexec.com):
A number of convenience store clerks, the story goes, noticed that men often bought beer at the same time they bought diapers. The store mined its receipts and proved the clerks' observations correct. So, the store began stocking diapers next to the beer coolers, and sales skyrocketed.
One common interpretation was that a new father was sent out in the night to get much needed diapers, which put him in the mood to buy a six-pack. Of course, that last part is purely subjective, but that's the story.
The article goes on to call this a "myth," but even if the specific case isn't verifiable, the decades-old example is on point for what it describes: Everyone™ is trying to make money by learning about predictable patterns, then exploiting those patterns to achieve their goals. This has been going on for thousands of years at a very human level in sales: vendors put up shops in high traffic areas, they're careful what they put in plain view to attract customers, they offer sales on one item and try to get you to buy more things once you're there, they give better prices to loyal customers. Think of those examples in a modern shopping mall, then think of them in an ancient city square. It's not hard to imagine examples in both places.
Note: I ramble a lot in this post, and I'm not sure I agree with everything I said, but I'm starting back to work today so I don't have a lot of time to muck with it and I'm trying to get content out, so... you've been warned. If you want some interesting reading on the topic, here's a few links:
I just entered West Virginia. This is because I'm on a trip and driving to the beach, having spent the night and dropped off a couple of dogs in Kentucky (with people [well… family], not just anywhere you know), we’re now safely ensconced in a Jeep Liberty, the four of us (two real people, two seventeen year olds) enjoying the extra space the dogs left us. Traveling this way means that you see a lot of countryside and inevitably have random conversations with family members you never see about politics and economics.
In particular, since the world economy has taken a bit of a dive lately, I figure it’s time for my personal rant on the topic. Let me start by saying that I’m not fond of economics, at least not formally. This stems mostly from an unfortunate economics teacher in college and my background in not-being-stupid. In one of our early classes, the professor drew an x-y axis on the chalkboard, placed a single data point, and after only a few moments of discussion drew a very attractive wavy line through it and called it a “supply-and-demand” curve like this:
Anyone that is not very upset by that chart should stop reading now, so that I do not offend you, and immediately go unfriend me on Facebook or put me in your “icky” circle on Google+ or something.
Here’s the math 101 short course for anyone that ignored my previous paragraph: you can’t draw a curve through only one point of data, because you don’t know which way the curve should go. It takes two points just to make a straight line, and at least three points to make a curve (and normally lots more unless you’re sure what shape the curve has). Is it a one-humped camel, or a two-humped camel? Or a sea-serpent? You get the idea. My relationship with formal economics went downhill from there.
Now that’s just the taste in my mouth – I admit it’s hugely important to study and understand how economies work. I am currently undergoing a microeconomic experiment by having just given the aforementioned 17 year olds $100 apiece to buy their own stuff for this trip so they won’t bother me and will hopefully learn the value of a dollar. Already they have passed up $7 slices of pizza to save money, so we’re learning something.
I've been a Sprint user for over 10 years, at least according to Amanda, who cheerfully explained to me why my cell phone bill never makes sense but that they appreciate my loyalty anyway as she sold me my new phone a couple of weeks ago.
My new phone is the HTC EVO 3D, but enough about that for now. First, it's important to talk about my PREVIOUS phone, which was the very underrated Samsung Moment.
The Moment was a very early generation Android phone which managed to hit just about all the design elements I wanted. Despite being a bit too slow (Angry Birds never played quite right) and missing out on simple things like multi-touch which Samsung apparently left out to get it to market quickly and inexpensively, it was one of my favorite phones. My Moment was a replacement for a Palm Treo which dutifully kept by my side for several years (forever in smartphone land).
The Moment was a wide slide-out keyboard styled phone. If anyone is reading this that designs phone keyboards, go pick this up and play with it -- it's the best. The keys are clearly separated and slightly raised so that touch-typing, such that is is on a teensy-weensy keyboard is actually possible. I wasn't quite able to bang out entire novellas without looking, but I could get pretty far into a decent text message with minimal mistakes while watching Netflix. The keyboard rocked.
Moreover, the Moment set aside the typical 4-button Android interface (Home, Menu, Back, Search) that seems prolific, instead opting for the three required buttons (Home, Menu, Back), and two buttons dedicated to phone operation (Pickup and Hangup, where the Hangup button also acted as a power button for the overall phone). Most importantly, though, the phone had a tiny touchpad that depressed as a select button. I haven't seen better cursor control on any smart phone, although the Palm and the Blackberry dedicated rollerballs and rockers are fairly close.
The HTC EVO 3D with which I now entertain myself boasts none of this coolness. The more-than-4-inch screen is gorgeous, responsive (the phone is wicked fast), and I've whittled down the on-screen keyboard options to a few that I like (I'm currently using SwiftKey X which has a curious habit of predicting words when nothing has been typed -- it currently assumes that I want to say "I am a beautiful person." if I don't give it any other starting letters). But it's not as cute or cuddly as the Samsung Moment.
But, and this is very important:
IT TAKES 3D PHOTOS!
Since no one has yet taken it upon themselves to write my unauthorized biography, it falls to me to make the following piece of information available to the public: I like to bake.
Breads and pies mostly -- I've got a couple of recipes posted here, including a pie crust that I'm pretty happy about, and a few things I've borrowed from other people. I've made a few rhubarbs lately that really turned out quite well.
One thing I recently attempted, was a Marbled Rye. This isn't a terribly difficult bread to make -- there are recipes everywhere. I was mostly pleased with it, though -- I didn't have any Caraway seeds, which add a lot of flavor, but the bread looked nice and better than a lot that I've made lately.
One thing I experimented with, though, was yeast.
Yeast is one of those things I don't really understand. This is because the most I remember about the biological classification taxonomy was that everything was an "Animal", "Vegetable", or "Mineral" -- I have no idea which one a yeast would be. This was a problem for biologists as well, so in 1990 they changed the top three domains to be "Archaea," "Bacteria," and "Eukaryota," which has helped me in no way whatsoever because not only do I still not know which one yeast would be, but I no longer know which one I'm supposed to be, and I much preferred back when I was an Animal and the world made sense.
Anyway, yeast are largely responsible for the existence of Bourbon, which automatically qualifies them as A Good Thing™ no matter what biologists call them. Baker's yeast, which makes us happy, is "Saccharomyces cerevisiae" (note the interesting comment in Wikipedia about Crohn's and Colitis on that page -- I never knew that), and lives everywhere, so it's pretty easy to get hold of. You can leave potato-starch filled water out for a while and yeast will just show up. All it does, really, is convert sugar into bubbles and alcohol. In breads, the bubbles (Carbon Dioxide) make the breads rise... in alcohols, the alcohol well, makes the alcohol alcoholic. Yeast is glorious.
We here at the happy technologist tend to host our own servers, because we like it, and because we can. (We also speak in the third person when there's just one of us, but there's no accounting for some people). Nothing fancy, mind you... for now, a handful of websites are running on an Ubuntu virtual machine through VirtualBox on a Windows 7 (or maybe Vista, I forget) box that otherwise serves as a Media Center. It's actually simpler than it sounds.
Lately, though, what with the Heritage Health Prize and a lot of hours spent learning and playing with data mining techniques, the poor little server has been called upon to do much more intensive work. It's routinely running simulations and calculations all night long and it's really not built for that. The fan has started humming heroically (i.e. loudly), which isn't always best for a media center.
Noone wants their media center to hate them, or to catch fire.
Enter Amazon EC2. That stands for Elastic Compute Cloud. See how clever that is -- what they did with that 2 there? Rather than go "ECC", they just counted the C twice and made it like a math or a chemistry equation. These Amazon guys are some serious funny. I'm actually very impressed with the setup they have. There's a wealth of options for configuring the virtual servers -- public AMIs (preconfigured images) are available for most major software vendor platforms, from the expected Oracle, Microsoft, and Linux offerings to MicroStrategy, R, Elastic Bamboo, Citrix, and even BitCoin configured software. Public data sets are available should you need them, advanced storage, database, failover, clustering, networking, identity management, queuing, notification, and probably a million other things at pennies-per-hour prices.
At the moment, I'm running a simulation on a 20-CPU 1.6 Terabyte beast of a machine for $0.228 per hour. This is the sort of thing that infuses me with glee. It's easily outperforming my media center by 30:1.
I wandered across bitcoin not too long ago, during some random web crawling, and downloaded it in May. I installed it, ran it, realized I was behind a firewall, killed it, uninstalled it and forgot about it for a couple of weeks until this Wired article came out and sent the whole world a'twitter about bitcoin again.
The Wired article, in short, talks about an underground website that sells illicit drugs and whose sole allowable currency is the Bitcoin. The website itself is shrouded in anonymity in the TOR network which itself is an excellent little piece of technology which I'm planning on running out of space to describe here just now, but you should look into it.
The Bitcoin spiked in popularity. You can buy and sell Bitcoins in open marketplaces such as Mt Gox (whatever that means) or Lillion Transfer if you're using some more international currencies, or you can use them directly on sites that take them, such as this Alpaca sock store. Prices quickly went from a few dollars to around $30, although they've now backed off a bit to around $20/BTC (Bitcoin).
Ok, so where are we? We can buy cocaine and alpaca socks with Bitcoins. Great. But what ARE they, again? How can you get some, and should you care?