MIMIC Data

Posted on Tue 22 September 2015 in ml, data-exploration

MIMIC II Demo DATA

The MIMIC II database demo is a subset of 4,000 (of over 32,000) patients from the MIMIC II database. These data are located here: http://physionet.org/mimic2/demo/.

No living patients are included in the demo subset (although many of these patients lived for up to several years followingi their ICU admissions documented in this data set). Although these data are exempt from HIPAA requirements for protecting health information of living individuals, the data have been very carefully deidentified, and we have removed free-text notes and reports as a further measure to reduce the possibility of disclosing information that might be used to identify these patients.

The data can be obtained as a disk image that includes Ubuntu 10.04, PostgreSQL 8.4.8, pgadmin3 1.10.2, and MIMIC II v2.6 with data for 4000 patients. This disk image can then be loaded via Virtual Box. Note: Make sure to select the install option when booting the first time in virtualbox. And the next time you boot, boot from disk.

PhysioNet/Computing in Cardiology Challenge 2012 Data

There is also data available for predicting mortality of ICU patients. It was used as a data challenge in 2012 and is publicly available here. It comes in text files and is already somewhat cleaned up, so it should be easy to get started with. I will not discuss these data in this post, but wanted to make a mention of it.

Moving the data onto your machine

If you are anything like me, then you prefer to work on your own machine as opposed to a VM. So lets get the data out of the VM and onto our machines.

  1. Open up the VM terminal and type "pg_dump --no-owner MIMIC2 > psql_dump_mimic2.sql" After this runs, you will have a file called "psql_dump_mimic2.sql"
  2. Follow this video on how to set up shared folders. Once you have a shared folder, copy psql_dump_mimic2.sql into that folder. Now you have access to this file on your actual machine.
  3. Install PostgreSQL onto your machine. If you are on OS X, I would recommend Postgres.app
  4. Create a new postgres database by typing "psql" in the terminal and then "create database mimic2;"
  5. Restore the dump file. In the terminal type "psql mimic2 < psql_dump_mimic2.sql" This will restore the dump file; it takes some time.
  6. Test that everything worked by running "psql mimic2" and then "select tablename from pg_catalog.pg_tables where schemaname='mimic2v26'" you should see the list of tables.

Reading the data from PostgreSQL

If you are new to PostgreSQL you can check out this tutorial. The fastest way to query the data is by typing "psql mimic2" in the terminal. You can then query the data. For example you can type "select * from mimic2v26.icustay_detail;" But I like doing my data work in Python, so I will show you how to use python to extract data from a PostgreSQL database.

PostgreSQL and Python

If you are new to Python I would recommend installing the Anaconda distribution. It comes with almost all the libraries you will need for scientific computing. Once installed, type "pip install psycopg2" to get the library to read the data from PostgreSQL. If you run into an "image not found" error when loading the library into Python, check out this post.

Python has a Pandas library that is great for working with data. This library also has an SQL interface, so you can query our PostgreSQL data and it will load into a Pandas data frame. This is a great format to start working with our data in Python.

In the next few steps, I am going to load in some data and do some quick investigations. I will be roughly following the MIMIC Data Science Primer.

Imports and Settings

In [4]:
import psycopg2
import pandas.io.sql as psql
import pandas as pd
from matplotlib import pyplot as plt
from __future__ import division #now division always returns a floating point number
import statsmodels.api as sm
import numpy as np
from sklearn.linear_model import LogisticRegression
import brewer2mpl
from sklearn import metrics
%matplotlib inline

Get Data

As you can see below, we first start by connecting to our database.

In [5]:
db = psycopg2.connect("dbname='mimic2' host='localhost'")

We can then get a list of all the tables in our database

In [6]:
tables = psql.read_sql("""select tablename from pg_catalog.pg_tables
                        where schemaname='mimic2v26';""", con=db)
tables.head()
Out[6]:
tablename
0 chartevents
1 comorbidity_scores
2 d_chartitems_detail
3 d_codeditems
4 d_parammap_items

There are 38 tables

In [7]:
tables.shape
Out[7]:
(38, 1)

We can also create a table that shows us which tables have which variables. You can see that we have a total of 255 variables.

In [8]:
#lets get information on the tables and the columns
col_info = psql.read_sql("""SELECT table_name, column_name
                            FROM information_schema.columns
                            WHERE table_schema = 'mimic2v26';""", con = db)
col_info['value'] = "YES"
col_info.head()
col_info = col_info.pivot(index='table_name', columns='column_name', values = 'value')
col_info.shape
Out[8]:
(38, 255)

This is useful because now we can see which tables have certain columns. For example, which tables have a subject_id?

In [9]:
col_info.ix[col_info.subject_id == 'YES']['subject_id']
Out[9]:
table_name
a_chartdurations      YES
a_iodurations         YES
a_meddurations        YES
additives             YES
admissions            YES
censusevents          YES
chartevents           YES
comorbidity_scores    YES
d_patients            YES
deliveries            YES
demographic_detail    YES
demographicevents     YES
drgevents             YES
icd9                  YES
icustay_days          YES
icustay_detail        YES
icustayevents         YES
ioevents              YES
labevents             YES
medevents             YES
microbiologyevents    YES
noteevents            YES
poe_order             YES
procedureevents       YES
totalbalevents        YES
Name: subject_id, dtype: object

Or what variables are in a certain table? For example, the icustay_detail table.

In [154]:
col_info.ix['icustay_detail',np.where(col_info.ix['icustay_detail',:] == "YES")]
Out[154]:
column_name
dob                          YES
dod                          YES
expire_flg                   YES
gender                       YES
hadm_id                      YES
height                       YES
hospital_admit_dt            YES
hospital_disch_dt            YES
hospital_expire_flg          YES
hospital_first_flg           YES
hospital_last_flg            YES
hospital_los                 YES
hospital_seq                 YES
hospital_total_num           YES
icustay_admit_age            YES
icustay_age_group            YES
icustay_expire_flg           YES
icustay_first_careunit       YES
icustay_first_flg            YES
icustay_first_service        YES
icustay_id                   YES
icustay_intime               YES
icustay_last_careunit        YES
icustay_last_flg             YES
icustay_last_service         YES
icustay_los                  YES
icustay_outtime              YES
icustay_seq                  YES
icustay_total_num            YES
matched_waveforms_num        YES
sapsi_first                  YES
sapsi_max                    YES
sapsi_min                    YES
sofa_first                   YES
sofa_max                     YES
sofa_min                     YES
subject_icustay_seq          YES
subject_icustay_total_num    YES
subject_id                   YES
weight_first                 YES
weight_max                   YES
weight_min                   YES
Name: icustay_detail, dtype: object

Now that we have connected to the database, we can query it just as you would in the command line. The great thing, though, is that now our data is returned in a Pandas data frame that we can work with in Python.

In [126]:
data = psql.read_sql("""select subject_id, hadm_id, icustay_admit_age,
                            weight_min, weight_max, gender,
                            icustay_first_careunit,
                            icustay_last_careunit,
                            hospital_los, expire_flg,
                            sapsi_max, sapsi_min,
                            date_part('day',dod - icustay_intime) as survival_days
                        from mimic2v26.icustay_detail
                        where icustay_age_group = 'adult'
                            and icustay_los > 24*60
                            and subject_icustay_seq = 1;""", con=db)

Here is a look at the first 3 rows of our data. Cool :) And the dimensions

In [127]:
data.head(n=3)
Out[127]:
subject_id hadm_id icustay_admit_age weight_min weight_max gender icustay_first_careunit icustay_last_careunit hospital_los expire_flg sapsi_max sapsi_min survival_days
0 26 15067 72.00637 NaN NaN M CCU CCU 10080 Y 6 5 658
1 37 18052 68.93134 NaN NaN M MICU MICU 7200 Y 11 11 501
2 78 15161 48.62681 72.4 72.4 M MICU MICU 4320 Y 14 11 1082
In [13]:
data.shape
Out[13]:
(3206, 12)

Clean Data

Remove rows with any missing data

In [14]:
data = data.dropna(axis=0)
data.shape
Out[14]:
(2698, 12)

Explore Data

Plots

Data visualization is a big part of data science, so I try and make my plots look good. The code below modifies some of the matplotlib defaults to make the plots look a bit cleaner. Source: Harvard CS 109 course.

In [15]:
# Set up some better defaults for matplotlib
from matplotlib import rcParams

#colorbrewer2 Dark2 qualitative color table
dark2_colors = brewer2mpl.get_map('Dark2', 'Qualitative', 7).mpl_colors

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'


def remove_border(axes=None, top=False, right=False, left=True, bottom=True, ax = None):
    """
    Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    if ax == None:
        ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

Lets look at the differences between males and females for age and weight

In [35]:
p = data[['gender','icustay_admit_age']].groupby('gender').boxplot(return_type='axes')
In [17]:
p = data[['gender','weight_max']].groupby('gender').boxplot()

Tables

Lets look at some different counts across gender

In [18]:
#number of men and women
data.gender.value_counts()
Out[18]:
M    1453
F    1245
dtype: int64
In [19]:
#counts by ICU
pd.crosstab(data.icustay_first_careunit, data.gender)
Out[19]:
gender F M
icustay_first_careunit
CCU 288 296
CSRU 309 455
FICU 132 177
MICU 478 479
SICU 38 46
In [20]:
#percentage by ICU
perct_icu = pd.crosstab(data.icustay_first_careunit, 
            data.gender).apply(lambda r:
                               r/r.sum(), axis=0)
perct_icu['diff'] = perct_icu.M - perct_icu.F
perct_icu
Out[20]:
gender F M diff
icustay_first_careunit
CCU 0.231325 0.203716 -0.027609
CSRU 0.248193 0.313145 0.064952
FICU 0.106024 0.121817 0.015793
MICU 0.383936 0.329663 -0.054273
SICU 0.030522 0.031659 0.001137
In [223]:
#max sapsi score difference
data.groupby('gender').sapsi_max.mean()
Out[223]:
gender
F         17.419561
M         16.942875
Name: sapsi_max, dtype: float64
In [224]:
#average admit age
data.groupby('gender').icustay_admit_age.mean()
Out[224]:
gender
F         73.220369
M         69.947374
Name: icustay_admit_age, dtype: float64
In [225]:
#average max weight
data.groupby('gender').weight_max.mean()
Out[225]:
gender
F         74.993463
M         87.143591
Name: weight_max, dtype: float64

MIMIC Cookbook

There exists a MIMIC cookbook which provides SQL code to calculate various measures such as the Elixhauser scores. Fortunately, they have a github repository so we can easily use this code. If you are new to git and github check out these tutorials or do some google searching. Lets go ahead and try out the PostgreSQL code for Elixhauser scores. Note: The code on github doesn't seem to always work out of the box. I had to make a few changes to the substring and to_number functions.

In [44]:
elixhauser = psql.read_sql("""
WITH icd9list AS (
SELECT adm.subject_id,
       adm.hadm_id,
       code,
       sequence,
       substring(code from '^\D') as icd9_alpha,
       to_number(substring(code from '\d+$|\d+\.\d+$'), '999999999') as icd9_numeric
  FROM mimic2v26.admissions adm,
       mimic2v26.icd9 icd
 WHERE adm.hadm_id = icd.hadm_id
   --AND adm.hadm_id < 100
  )
--SELECT * FROM icd9list;
, drglist AS (
SELECT adm.subject_id,
       adm.hadm_id,
       to_number(ci.code, '999999999') AS codenum,
       ci.description
  FROM mimic2v26.admissions adm,
       mimic2v26.drgevents drg,
       mimic2v26.d_codeditems ci
 WHERE adm.hadm_id = drg.hadm_id
   AND drg.itemid = ci.itemid
   AND ci.type='HFCA_DRG'
  )
--SELECT * FROM drglist;
, drg_category AS (
SELECT subject_id,
       hadm_id,
    CASE
      WHEN (drglist.codenum >= 103 AND drglist.codenum <= 108)
      OR (drglist.codenum >= 110 AND drglist.codenum <= 112)
      OR (drglist.codenum >= 115 AND drglist.codenum <= 118)
      OR (drglist.codenum >= 120 AND drglist.codenum <= 127)
      OR drglist.codenum = 129
      OR (drglist.codenum >= 132 AND drglist.codenum <= 133)
      OR (drglist.codenum >= 135 AND drglist.codenum <= 143)
      THEN 1
      ELSE 0
    END AS cardiac,
    CASE
      WHEN (drglist.codenum >= 302 AND drglist.codenum <= 305)
      OR (drglist.codenum >= 315 AND drglist.codenum <= 333)
      THEN 1
      ELSE 0
    END AS renal,
    CASE
      WHEN (drglist.codenum >= 199 AND drglist.codenum <= 202)
      OR (drglist.codenum >= 205 AND drglist.codenum <= 208)
      THEN 1
      ELSE 0
    END AS liver,
    CASE
      WHEN (drglist.codenum >= 400 AND drglist.codenum <= 414)
      OR drglist.codenum = 473
      OR drglist.codenum = 492
      THEN 1
      ELSE 0
    END AS leukemia_lymphoma,
    CASE
      WHEN drglist.codenum = 10
      OR drglist.codenum = 11
      OR drglist.codenum = 64
      OR drglist.codenum = 82
      OR drglist.codenum = 172
      OR drglist.codenum = 173
      OR drglist.codenum = 199
      OR drglist.codenum = 203
      OR drglist.codenum = 239
      OR (drglist.codenum >= 257 AND drglist.codenum <= 260)
      OR drglist.codenum = 274
      OR drglist.codenum = 275
      OR drglist.codenum = 303
      OR drglist.codenum = 318
      OR drglist.codenum = 319
      OR drglist.codenum = 338
      OR drglist.codenum = 344
      OR drglist.codenum = 346
      OR drglist.codenum = 347
      OR drglist.codenum = 354
      OR drglist.codenum = 355
      OR drglist.codenum = 357
      OR drglist.codenum = 363
      OR drglist.codenum = 366
      OR drglist.codenum = 367
      OR (drglist.codenum >= 406 AND drglist.codenum <= 414)
      THEN 1
      ELSE 0
    END AS cancer,
    CASE
      WHEN drglist.codenum = 88
      THEN 1
      ELSE 0
    END AS copd,
    CASE
      WHEN (drglist.codenum >= 130 AND drglist.codenum <= 131)
      THEN 1
      ELSE 0
    END AS peripheral_vascular,
    CASE
      WHEN drglist.codenum = 134
      THEN 1
      ELSE 0
    END AS hypertension,
    CASE
      WHEN (drglist.codenum >= 14 AND drglist.codenum <= 17)
      OR drglist.codenum=5
      THEN 1
      ELSE 0
    END AS cerebrovascular,
    CASE
      WHEN (drglist.codenum >= 1 AND drglist.codenum <= 35)
      THEN 1
      ELSE 0
    END AS nervous_system,
    CASE
      WHEN (drglist.codenum >= 96 AND drglist.codenum <= 98)
      THEN 1
      ELSE 0
    END AS asthma,
    CASE
      WHEN (drglist.codenum >= 294 AND drglist.codenum <= 295)
      THEN 1
      ELSE 0
    END AS diabetes,
    CASE
      WHEN drglist.codenum = 290
      THEN 1
      ELSE 0
    END AS thyroid,
    CASE
      WHEN (drglist.codenum >= 300 AND drglist.codenum <= 301)
      THEN 1
      ELSE 0
    END AS endocrine,
    CASE
      WHEN drglist.codenum = 302
      THEN 1
      ELSE 0
    END AS kidney_transplant,
    CASE
      WHEN (drglist.codenum >= 316 AND drglist.codenum <= 317)
      THEN 1
      ELSE 0
    END AS renal_failure_dialysis,
    CASE
      WHEN (drglist.codenum >= 174 AND drglist.codenum <= 178)
      THEN 1
      ELSE 0
    END AS gi_hemorrhage_ulcer,
    CASE
      WHEN (drglist.codenum >= 488 AND drglist.codenum <= 490)
      THEN 1
      ELSE 0
    END AS hiv,
    CASE
      WHEN (drglist.codenum >= 240 AND drglist.codenum <= 241)
      THEN 1
      ELSE 0
    END AS connective_tissue,
    CASE
      WHEN drglist.codenum = 397
      THEN 1
      ELSE 0
    END AS coagulation,
    CASE
      WHEN drglist.codenum = 288
      THEN 1
      ELSE 0
    END AS obesity_procedure,
    CASE
      WHEN (drglist.codenum >= 396 AND drglist.codenum <= 298)
      THEN 1
      ELSE 0
    END AS nutrition_metabolic,
    CASE
      WHEN (drglist.codenum >= 395 AND drglist.codenum <= 396)
      THEN 1
      ELSE 0
    END AS anemia,
    CASE
      WHEN (drglist.codenum >= 433 AND drglist.codenum <= 437)
      THEN 1
      ELSE 0
    END AS alcohol_drug,
    CASE
      WHEN drglist.codenum = 430
      THEN 1
      ELSE 0
    END AS psychoses,
    CASE
      WHEN drglist.codenum = 426
      THEN 1
      ELSE 0
    END AS depression
  FROM drglist
  )
--SELECT * FROM drg_category;  
, elixhauser AS (
SELECT icd.subject_id,
       icd.hadm_id,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 398.91
        OR icd.icd9_numeric = 402.11
        OR icd.icd9_numeric = 402.91
        OR icd.icd9_numeric = 404.11
        OR icd.icd9_numeric = 404.13
        OR icd.icd9_numeric = 404.91
        OR icd.icd9_numeric = 404.93
        OR icd.icd9_numeric BETWEEN 428 AND 428.9)
        AND drg.cardiac = 0
        THEN 1
        ELSE 0
      END
      ) AS congestive_heart_failure,
      MAX(
      CASE
        WHEN ((icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 426.1
        OR icd.icd9_numeric = 426.11
        OR icd.icd9_numeric = 426.13
        OR icd.icd9_numeric BETWEEN 426.2 AND 426.53
        OR icd.icd9_numeric BETWEEN 426.6 AND 426.89
        OR icd.icd9_numeric = 427
        OR icd.icd9_numeric = 427.2
        OR icd.icd9_numeric = 427.31
        OR icd.icd9_numeric = 427.6
        OR icd.icd9_numeric = 427.9
        OR icd.icd9_numeric = 785))
        OR (icd.icd9_alpha = 'V'
        AND (icd.icd9_numeric = 45
        OR icd.icd9_numeric = 53.3)))
        AND drg.cardiac = 0
        THEN 1
        ELSE 0
      END
      ) AS cardiac_arrhythmias,
      MAX(
      CASE
        WHEN ((icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 93.2 AND 93.24
        OR icd.icd9_numeric BETWEEN 394 AND 397.1
        OR icd.icd9_numeric BETWEEN 424 AND 424.91
        OR icd.icd9_numeric BETWEEN 746.3 AND 746.6))
        OR (icd.icd9_alpha = 'V'
        AND (icd.icd9_numeric = 42.2
        OR icd.icd9_numeric = 43.3)))
        AND drg.cardiac = 0
        THEN 1
        ELSE 0
      END
      ) AS valvular_disease,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 416 AND 416.9
        OR icd.icd9_numeric = 417.9)
        AND (drg.cardiac = 0 AND drg.copd = 0)
        THEN 1
        ELSE 0
      END
      ) AS pulmonary_circulation,
      MAX(
      CASE
        WHEN ((icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 440 AND 440.9
        OR icd.icd9_numeric = 441.2
        OR icd.icd9_numeric = 441.4
        OR icd.icd9_numeric = 441.7
        OR icd.icd9_numeric = 441.9
        OR icd.icd9_numeric BETWEEN 443.1 AND 443.9
        OR icd.icd9_numeric = 447.1
        OR icd.icd9_numeric = 557.1
        OR icd.icd9_numeric = 557.9))
        OR (icd.icd9_alpha = 'V'
        AND icd.icd9_numeric = 43.4))
        AND drg.peripheral_vascular = 0
        THEN 1
        ELSE 0
      END
      ) AS peripheral_vascular,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 401.1
        OR icd.icd9_numeric = 401.9
        OR icd.icd9_numeric = 402.1
        OR icd.icd9_numeric = 402.9
        OR icd.icd9_numeric = 404.1
        OR icd.icd9_numeric = 404.9
        OR icd.icd9_numeric = 405.11
        OR icd.icd9_numeric = 405.19
        OR icd.icd9_numeric = 405.91
        OR icd.icd9_numeric = 405.99)
        AND (drg.hypertension = 0 AND drg.cardiac = 0 AND drg.renal = 0)
        THEN 1
        ELSE 0
      END
      ) AS hypertension,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 342 AND 342.12
        OR icd.icd9_numeric BETWEEN 342.9 AND 344.9)
        AND drg.cerebrovascular = 0
        THEN 1
        ELSE 0
      END
      ) AS paralysis,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 331.9
        OR icd.icd9_numeric = 332
        OR icd.icd9_numeric = 333.4
        OR icd.icd9_numeric = 333.5
        OR icd.icd9_numeric BETWEEN 334 AND 335.9
        OR icd.icd9_numeric = 340
        OR icd.icd9_numeric BETWEEN 341.1 AND 341.9
        OR icd.icd9_numeric BETWEEN 345 AND 345.11
        OR icd.icd9_numeric BETWEEN 345.4 AND 345.51
        OR icd.icd9_numeric BETWEEN 345.8 AND 345.91
        OR icd.icd9_numeric = 348.1
        OR icd.icd9_numeric = 348.3
        OR icd.icd9_numeric = 780.3
        OR icd.icd9_numeric = 784.3)
        AND drg.nervous_system = 0
        THEN 1
        ELSE 0
      END
      ) AS other_neurological,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 490 AND 492.8
        OR icd.icd9_numeric BETWEEN 493 AND 493.91
        OR icd.icd9_numeric = 494
        OR icd.icd9_numeric BETWEEN 495 AND 505
        OR icd.icd9_numeric = 506.4)
        AND (drg.copd = 0 AND drg.asthma = 0)
        THEN 1
        ELSE 0
      END
      ) AS chronic_pulmonary,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND icd.icd9_numeric BETWEEN 250 AND 250.33        
        AND drg.diabetes = 0
        THEN 1
        ELSE 0
      END
      ) AS diabetes_uncomplicated,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 250.4 AND 250.73
        OR icd.icd9_numeric BETWEEN 250.9 AND 250.93)
        AND drg.diabetes = 0
        THEN 1
        ELSE 0
      END
      ) AS diabetes_complicated,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 243 AND 244.2
        OR icd.icd9_numeric = 244.8
        OR icd.icd9_numeric = 244.9)
        AND (drg.thyroid = 0 AND drg.endocrine = 0)
        THEN 1
        ELSE 0
      END
      ) AS hypothyroidism,
      MAX(
      CASE
        WHEN ((icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 403.11 
        OR icd.icd9_numeric = 403.91
        OR icd.icd9_numeric = 404.12
        OR icd.icd9_numeric = 404.92
        OR icd.icd9_numeric = 585
        OR icd.icd9_numeric = 586))
        OR (icd.icd9_alpha = 'V'
        AND (icd.icd9_numeric = 42
        OR icd.icd9_numeric = 45.1
        OR icd.icd9_numeric = 56
        OR icd.icd9_numeric = 56.8)))
        AND (drg.kidney_transplant = 0 AND renal_failure_dialysis = 0)
        THEN 1
        ELSE 0
      END
      ) AS renal_failure,
      MAX(
      CASE
        WHEN ((icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 70.32
        OR icd.icd9_numeric = 70.33
        OR icd.icd9_numeric = 70.54
        OR icd.icd9_numeric = 456
        OR icd.icd9_numeric = 456.1
        OR icd.icd9_numeric = 456.2
        OR icd.icd9_numeric = 456.21
        OR icd.icd9_numeric = 571
        OR icd.icd9_numeric = 571.2
        OR icd.icd9_numeric = 571.3
        OR icd.icd9_numeric BETWEEN 571.4 AND 571.49
        OR icd.icd9_numeric = 571.5
        OR icd.icd9_numeric = 571.6
        OR icd.icd9_numeric = 571.8
        OR icd.icd9_numeric = 571.9
        OR icd.icd9_numeric = 572.3
        OR icd.icd9_numeric = 572.8))
        OR (icd.icd9_alpha = 'V'
        AND icd.icd9_numeric = 42.7))
        AND drg.liver = 0
        THEN 1
        ELSE 0
      END
      ) AS liver_disease,
      MAX(
      CASE
        WHEN ((icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 531.7
        OR icd.icd9_numeric = 531.9
        OR icd.icd9_numeric = 532.7
        OR icd.icd9_numeric = 532.9
        OR icd.icd9_numeric = 533.7
        OR icd.icd9_numeric = 533.9
        OR icd.icd9_numeric = 534.7
        OR icd.icd9_numeric = 534.9))
        OR (icd.icd9_alpha = 'V'
        AND icd.icd9_numeric = 12.71))
        AND drg.gi_hemorrhage_ulcer = 0
        THEN 1
        ELSE 0
      END
      ) AS peptic_ulcer,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND icd.icd9_numeric BETWEEN 42 AND 44.9        
        AND drg.hiv = 0
        THEN 1
        ELSE 0
      END
      ) AS aids,
      MAX(
      CASE
        WHEN ((icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 200 AND 202.38 
        OR icd.icd9_numeric BETWEEN 202.5 AND 203.01
        OR icd.icd9_numeric BETWEEN 203.8 AND 203.81
        OR icd.icd9_numeric = 238.6
        OR icd.icd9_numeric = 273.3))
        OR (icd.icd9_alpha = 'V'
        AND (icd.icd9_numeric = 10.71
        OR icd.icd9_numeric = 10.72
        OR icd.icd9_numeric = 10.79)))
        AND drg.leukemia_lymphoma = 0
        THEN 1
        ELSE 0
      END
      ) AS lymphoma,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND icd.icd9_numeric BETWEEN 196 AND 199.1        
        AND drg.cancer = 0
        THEN 1
        ELSE 0
      END
      ) AS metastatic_cancer,
      MAX(
      CASE
        WHEN ((icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 140 AND 172.9
        OR icd.icd9_numeric BETWEEN 174 AND 175.9
        OR icd.icd9_numeric BETWEEN 179 AND 195.8))
        OR (icd.icd9_alpha = 'V'
        AND icd.icd9_numeric BETWEEN 10 AND 10.9))
        AND drg.cancer = 0
        THEN 1
        ELSE 0
      END
      ) AS solid_tumor,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 701
        OR icd.icd9_numeric BETWEEN 710 AND 710.9
        OR icd.icd9_numeric BETWEEN 714 AND 714.9
        OR icd.icd9_numeric BETWEEN 720 AND 720.9
        OR icd.icd9_numeric = 725)
        AND drg.connective_tissue = 0
        THEN 1
        ELSE 0
      END
      ) AS rheumatoid_arthritis,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 2860 AND 2869
        OR icd.icd9_numeric = 287.1
        OR icd.icd9_numeric BETWEEN 287.3 AND 287.5)
        AND drg.coagulation = 0
        THEN 1
        ELSE 0
      END
      ) AS coagulopathy,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND icd.icd9_numeric = 278        
        AND (drg.obesity_procedure = 0 AND drg.nutrition_metabolic = 0)
        THEN 1
        ELSE 0
      END
      ) AS obesity,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND icd.icd9_numeric BETWEEN 260 AND 263.9        
        AND drg.nutrition_metabolic = 0
        THEN 1
        ELSE 0
      END
      ) AS weight_loss,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND icd.icd9_numeric BETWEEN 276 AND 276.9        
        AND drg.nutrition_metabolic = 0
        THEN 1
        ELSE 0
      END
      ) AS fluid_electrolyte,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND icd.icd9_numeric = 2800        
        AND drg.anemia = 0
        THEN 1
        ELSE 0
      END
      ) AS blood_loss_anemia,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 280.1 AND 281.9
        OR icd.icd9_numeric = 285.9)
        AND drg.anemia = 0
        THEN 1
        ELSE 0
      END
      ) AS deficiency_anemias,
      MAX(
      CASE
        WHEN ((icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 291.1
        OR icd.icd9_numeric = 291.2
        OR icd.icd9_numeric = 291.5
        OR icd.icd9_numeric = 291.8
        OR icd.icd9_numeric = 291.9
        OR icd.icd9_numeric BETWEEN 303.9 AND 303.93
        OR icd.icd9_numeric BETWEEN 305 AND 305.03))
        OR (icd.icd9_alpha = 'V'
        AND icd.icd9_numeric = 113))
        AND drg.alcohol_drug = 0
        THEN 1
        ELSE 0
      END
      ) AS alcohol_abuse,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 292
        OR icd.icd9_numeric BETWEEN 292.82 AND 292.89
        OR icd.icd9_numeric = 292.9
        OR icd.icd9_numeric BETWEEN 304 AND 304.93
        OR icd.icd9_numeric BETWEEN 305.2 AND 305.93)
        AND drg.alcohol_drug = 0
        THEN 1
        ELSE 0
      END
      ) AS drug_abuse,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric BETWEEN 295 AND 298.9
        OR icd.icd9_numeric BETWEEN 299.1 AND 299.11)
        AND drg.psychoses = 0
        THEN 1
        ELSE 0
      END
      ) AS psychoses,
      MAX(
      CASE
        WHEN icd.icd9_alpha IS NULL
        AND (icd.icd9_numeric = 300.4
        OR icd.icd9_numeric = 301.12
        OR icd.icd9_numeric = 309
        OR icd.icd9_numeric = 309.1
        OR icd.icd9_numeric = 311)
        AND drg.depression = 0
        THEN 1
        ELSE 0
      END
      ) AS depression
    FROM icd9list icd, drg_category drg
    WHERE icd.hadm_id = drg.hadm_id
    GROUP BY icd.subject_id, icd.hadm_id
    )
SELECT 
  SUBJECT_ID,
  HADM_ID,
  'ELIXHAUSER',
  CONGESTIVE_HEART_FAILURE,
  CARDIAC_ARRHYTHMIAS,
  VALVULAR_DISEASE,
  PULMONARY_CIRCULATION,
  PERIPHERAL_VASCULAR,
  HYPERTENSION,
  PARALYSIS,
  OTHER_NEUROLOGICAL,
  CHRONIC_PULMONARY,
  DIABETES_UNCOMPLICATED,
  DIABETES_COMPLICATED,
  HYPOTHYROIDISM,
  RENAL_FAILURE,
  LIVER_DISEASE,
  PEPTIC_ULCER,
  AIDS,
  LYMPHOMA,
  METASTATIC_CANCER,
  SOLID_TUMOR,
  RHEUMATOID_ARTHRITIS,
  COAGULOPATHY,
  OBESITY,
  WEIGHT_LOSS,
  FLUID_ELECTROLYTE,
  BLOOD_LOSS_ANEMIA,
  DEFICIENCY_ANEMIAS,
  ALCOHOL_ABUSE,
  DRUG_ABUSE,
  PSYCHOSES,
  DEPRESSION
FROM elixhauser ;""", con=db)

Lets create a column for the total number of elixhauser points and merge it onto our data

In [144]:
elixhauser['total_elix'] = elixhauser.drop(['subject_id', '?column?'], 1).sum(1)
elixhauser = elixhauser.drop_duplicates(subset=['subject_id', 'hadm_id'])
data2 = pd.merge(data, elixhauser, how='left', on=['subject_id', 'hadm_id'])

Demographic Information

Lets also pull in some demograpic information for our patients

In [160]:
demo_info = psql.read_sql("""
    select subject_id, hadm_id, marital_status_descr,
        case
            when ethnicity_descr like '%WHITE%' then 'WHITE'
            when ethnicity_descr like '%BLACK%' then 'BLACK'
            when ethnicity_descr like '%HISPANIC%' then 'HISPANIC'
            when ethnicity_descr like '%ASIAN%' then 'ASIAN'
            else 'OTHER' end as ethnicity,
        admission_type_descr
    from mimic2v26.demographic_detail;""", con=db)
demo_info.head()
Out[160]:
subject_id hadm_id marital_status_descr ethnicity admission_type_descr
0 26 15067 SINGLE OTHER EMERGENCY
1 37 18052 MARRIED WHITE EMERGENCY
2 78 15161 SEPARATED BLACK EMERGENCY
3 67 35878 SINGLE WHITE EMERGENCY
4 56 28766 None WHITE EMERGENCY

And merge it onto our data

In [226]:
data_2v2 = pd.merge(data2, demo_info, how='left', on=['subject_id', 'hadm_id'])

Logisitc Regression Year Survival Rate

Now that we have some data together, we can do a quick logistic regression to try and classify whether a person will die before a year after admission.

We first need to create some dummy variables

In [212]:
data_2v2 = data_2v2.dropna(axis=0)
dummy_list = ['icustay_first_careunit', 'gender', 'marital_status_descr', 'ethnicity', 'admission_type_descr']
dummy_frames = [pd.get_dummies(data_2v2[x]) for x in dummy_list]
dummy_frames = pd.concat(dummy_frames, axis=1)
In [216]:
data_2v2['year_die'] = (data_2v2.survival_days < 365).astype(int)
data3 = pd.concat([data_2v2, dummy_frames], axis=1)
data3['intercept'] = 1.0
#note: FICU is the same as MICU, so combining
data3['MICU'] = np.where(data3.FICU == 1, 1, data3.MICU)
train_cols = ['intercept','icustay_admit_age', 'weight_max','M', 'CSRU', 'MICU', 'SICU',
              'sapsi_max', 'sapsi_min', 'total_elix', 'DIVORCED', 'MARRIED', 'SEPARATED', 'WIDOWED',
              'ASIAN', 'BLACK', 'HISPANIC', 'EMERGENCY', 'URGENT']

Use statsmodels to do the estimation becuase we get P values.

In [217]:
logit = sm.Logit(data3.year_die, data3[train_cols])
result = logit.fit()
result.summary()
Optimization terminated successfully.
         Current function value: 0.533862
         Iterations 7
Out[217]:
Logit Regression Results
Dep. Variable: year_die No. Observations: 2208
Model: Logit Df Residuals: 2189
Method: MLE Df Model: 18
Date: Fri, 02 Jan 2015 Pseudo R-squ.: 0.1225
Time: 21:22:35 Log-Likelihood: -1178.8
converged: True LL-Null: -1343.3
LLR p-value: 4.799e-59
coef std err z P>|z| [95.0% Conf. Int.]
intercept -2.0407 0.431 -4.731 0.000 -2.886 -1.195
icustay_admit_age -0.0066 0.004 -1.605 0.109 -0.015 0.001
weight_max -0.0038 0.002 -1.727 0.084 -0.008 0.001
M -0.0920 0.111 -0.832 0.406 -0.309 0.125
CSRU -0.1025 0.141 -0.726 0.468 -0.379 0.174
MICU 0.4103 0.130 3.162 0.002 0.156 0.665
SICU 1.3303 0.462 2.878 0.004 0.424 2.236
sapsi_max 0.0786 0.012 6.759 0.000 0.056 0.101
sapsi_min 0.1032 0.016 6.533 0.000 0.072 0.134
total_elix 4.213e-07 7.14e-08 5.903 0.000 2.81e-07 5.61e-07
DIVORCED -0.0580 0.236 -0.245 0.806 -0.521 0.405
MARRIED 0.0081 0.141 0.057 0.954 -0.269 0.285
SEPARATED 0.4553 0.592 0.770 0.442 -0.704 1.615
WIDOWED -0.2520 0.172 -1.467 0.143 -0.589 0.085
ASIAN 0.5206 0.503 1.034 0.301 -0.466 1.507
BLACK -0.2541 0.229 -1.112 0.266 -0.702 0.194
HISPANIC -0.4725 0.404 -1.168 0.243 -1.265 0.320
EMERGENCY 0.7361 0.164 4.490 0.000 0.415 1.057
URGENT -0.0675 0.266 -0.254 0.800 -0.589 0.454

Lets take a look at the SICU, year_die table. You can see that 91% of people in the SICU die before a year as opposed to 70% not in the SICU. Thus, it makes sense that it has a large, significant coefficient.

In [228]:
sicu = pd.crosstab(data3.SICU, data3.year_die)
sicu_pct = sicu.div(sicu.sum(axis=1), axis=0)
sicu_pct
Out[228]:
year_die 0 1
SICU
0 0.303172 0.696828
1 0.093750 0.906250

Additional Resoureces

Here are some additional links to help you better understand the MIMIC data