The Happy Technologist Interesting Geekdom

11Jun/134

Quick Python for ICD-10 XML Parsing

ICD-10 coding is a hot topic in medical data circles this year.  The short version is that, when you visit a doctor, they have a standard set of codes for both the Diagnoses and the Procedures relevant to your visit.  ICD, which stands for "International Classification for Diseases" has been around since 1900... that's right, 113 years of standard medical coding and we still have a mess of healthcare data.  Ugh.  But ICD-9, which was the first to formally include Procedure codes (as ICPM) and not just Diagnoses, started in 1979 and is due for a facelift.

ICD-10 is the facelift, and it's a pretty large overhaul.  Where ICD-9 had over 14,000 diagnosis codes, ICD-10 has over 43,000.  Many U.S. laws (mostly those that are touched by HIPAA) are requiring adherance to ICD-10 by October, 2014,  spawning a flurry of headless-chickens, and a rich field for consulting and the spending of lots of money.

Enter my job.  I'm trying to graft the "official" ICD9/10 crosswalk and code data into a Data Warehouse, in preparation for the analysis that needs to follow.  Naturally, I go and download the official data from here:  (Broken link, see update) and set of in SSIS to get things moving, because that's what we use here.

UPDATE (2016-08):  The cms.gov links change annually, and the old ones die apparently... the latest is here: https://www.cms.gov/Medicare/Coding/ICD10/2017-ICD-10-CM-and-GEMs.html but I'm not going to keep udpating it.  Search Google for cms.gov ICD10.  Also, a very nice SEO person from zogmedia.com pointed this out, in a bit of a linkbaiting message, but, hey, they have a point, and they were cool about it... they wanted me to link to this site which may be better updated:  http://www.nuemd.com/icd-10/codes.

SSIS is plagued with issues.  I really must say that I don't like it.  Having worked with everything from Informatica (obnote: I own some INFA stock) to mysqlimport via bash shell for ETL,  SSIS is low on my list.  In particular, for this project, when trying to load the XML files provided by CMS, SSIS complained that it can't handle XML with mixed content in the XMLSource widget.  Once I tweaked the .xsd (which I shouldn't have to do) to get around this, it complained of special characters in fields and got too frustrating to deal with.  Yes, there are alternatives in SSIS, but most involve coding in Visual Basic or C# and STILL using the SSIS tool.  This is a monolithic hammer to handle a very simple problem.

Look, all I really want is a list of codes and descriptions from the XML document.  There is a LOT of other useful metadata in there, but for now, it can wait.  Here's a simple (not robust) solution in a handful of python lines:


import xml.etree.ElementTree as ET
import csv

csvwriter = csv.writer(open('diagnostics.csv', 'wb'))

tree = ET.parse('ICD10CM_FY2013_Full_XML_Tabular.xml')
root = tree.getroot()

for diag in root.iter('diag'):           # Loop through every diagnostic tree
   name = diag.find('name').text.encode('utf8')  # Extract the diag code
   desc = diag.find('desc').text.encode('utf8')  # Extract the description
   csvwriter.writerow((name,desc))       # write to a .csv file

And there we have a .csv which is much easier to load with whatever tool we want. This works well for the other XML files as well such as the DIndex and EIndex files, except for some reason they use different, optional, tags for their hierarchies... "mainTerm"s are the parent diagnostic codes and "term"s are the optional children. I'll leave that as an exercise, though, it's not too bad. 😉

Comments (4) Trackbacks (0)
  1. Thank you so much . Saved my day today

    • Apologies . Seem to have trouble . It only works well to extract name and description . When I attempt to extract inclusionTerm notes , excludes1 notes etc it has a challenge . Suppose if a diagnosis has subsets [ diagnosis codes within a diagnosis ] then it picks up the notes for inclusion term from all the subccodes instead of the intended one.

      • Hi Rajesh,

        Sorry to hear that… I suppose it makes sense that the XML parser would iterate over the entire subtree. I hope you were able to fix this; if you want me to take a look at it with a specific example let me know and I’ll try to improve the code. It was really just a quick one-off starter snippet to get us going on a project, and we didn’t take it much further. Thanks for taking a look!

  2. Very informative post.Thanks


Leave a comment

No trackbacks yet.