High School Offensive Tackles

Posted on Mon 14 March 2016 in data-exploration • Tagged with nfl

Offensive Tackles in Football

So I have a brother who is currently 16 and plays high school football as an offensive tackle. He is 6 feet 2 inches and weights 250 pounds. Now that is a big kid. His dream is to play professional football one day.

The first step to playing professional football is to play college football for hopefully a good team. Not knowing much about what it takes to get recruited by a top college football team, I thought I would look for some data. Fortunately, ESPN has height and weight data on the top 100 offensive tackles
Continue reading

Bayes Primer

Posted on Sat 17 October 2015 in ml • Tagged with tutorial, bayesian

What is Bayes Theorem?

Bayes theorem is what allows us to go from a sampling (or likelihood) distribution and a prior distribution to a posterior distribution.

What is a Sampling Distribution?

A sampling distribution is the probability of seeing our data (X) given our parameters ($\theta$). This is written as $p(X|\theta)$.

For example, we might have data on 1,000 coin flips. Where 1 indicates a head. This can be represented in python as

Continue reading

Predicting Fantasy Football Points

Posted on Wed 07 October 2015 in applied-ml • Tagged with nfl

Predicting Fantasy Football Points

If you read my last post you will know that I recently started fantasy football and my team isn't doing so great. Currently 0 and 4. Ha!

What seemed strange to me, though, is that my team kept underperforming relative to the ESPN projections. The consistent underperformance lead me to try and develop my own prediction model to see if I couldn't maybe do a better job.

Continue reading

Fantasy Football

Posted on Wed 30 September 2015 in data-exploration • Tagged with nfl

Fantasy Football

So - this is my first year participating in a fantasy football league. I enjoy football, but I typically only keep up with a few teams, so drafting an actual team was a bit daunting. So, like most things, I relied on data to help me out. I spent some time researching strategy, looking at projections, and even simulating some drafts. Draft day came and I felt pretty good about my team...but now I am currently 0-3 for the season. Ha.

Continue reading


Posted on Tue 22 September 2015 in ml, data-exploration • Tagged with health


The MIMIC II database demo is a subset of 4,000 (of over 32,000) patients from the MIMIC II database. These data are located here: http://physionet.org/mimic2/demo/.

No living patients are included in the demo subset (although many of these patients lived for up to several years followingi their ICU admissions documented in this data set). Although these data are exempt from HIPAA requirements for protecting health information of living individuals, the data have been very carefully deidentified, and we have removed free-text notes and reports as a further measure to reduce the possibility of disclosing information that might be used to identify these patients.

Continue reading

Spark DataFrames

Posted on Sat 01 August 2015 in big-data • Tagged with spark


Spark is a really awesome tool to easily do distributed computations in order to process large-scale data. To be honest, most people probably don't need spark for their own side projects - most of these data will fit in memory or work well in a traditional database like PostgreSQL. That being said, there is a good chance you might need Spark if you are doing data science type work for your job. A lot of companies have a tremendous amount of data and Spark is a great tool to help effectively process these large data.

Continue reading

Talk Pay

Posted on Fri 01 May 2015 in data-exploration • Tagged with twitter


Logistic Regression and Optimization

Posted on Wed 29 April 2015 in ml • Tagged with tutorial, logistic-regression

Logistic Regression and Gradient Descent

Logistic regression is an excellent tool to know for classification problems. Classification problems are problems where you are trying to classify observations into groups. To make our examples more concrete, we will consider the Iris dataset. The iris dataset contains 4 attributes for 3 types of iris plants. The purpose is to classify which plant you have just based on the attributes. To simplify things, we will only consider 2 attributes and 2 classes. Here are the data visually:

Continue reading

Bayes With Continuous Prior

Posted on Fri 03 April 2015 in ml • Tagged with bayesian, tutorial

Continuous Prior

In my introduction to Bayes post, I went over a simple application of Bayes theorem to Bernoulli distributed data. In this post, I want to extend our example to use a continous prior.

In my last post, I ended with this code:

Python For Data Mining

Posted on Sat 17 January 2015 in ml, data-exploration • Tagged with tutorial


Python is a great language for data mining. It has a lot of great libraries for exploring, modeling, and visualizing data. To get started I would recommend downloading the Anaconda Package. It comes with most of the libraries you will need and provides and IDE and package manager.

Continue reading