Offensive Tackles in Football¶
So I have a brother who is currently 16 and plays high school football as an offensive tackle. He is 6 feet 2 inches and weights 250 pounds. Now that is a big kid. His dream is to play professional football one day.
The first step to playing professional football is to play college football for hopefully a good team. Not knowing much about what it takes to get recruited by a top college football team, I thought I would look for some data. Fortunately, ESPN has height and weight data on the top 100 offensive tackles
What is Bayes Theorem?¶
Bayes theorem is what allows us to go from a sampling (or likelihood) distribution and a prior distribution to a posterior distribution.
What is a Sampling Distribution?¶
A sampling distribution is the probability of seeing our data (X) given our parameters ($\theta$). This is written as $p(X|\theta)$.
For example, we might have data on 1,000 coin flips. Where 1 indicates a head. This can be represented in python as
Predicting Fantasy Football Points¶
If you read my last post you will know that I recently started fantasy football and my team isn't doing so great. Currently 0 and 4. Ha!
What seemed strange to me, though, is that my team kept underperforming relative to the ESPN projections. The consistent underperformance lead me to try and develop my own prediction model to see if I couldn't maybe do a better job.
So - this is my first year participating in a fantasy football league. I enjoy football, but I typically only keep up with a few teams, so drafting an actual team was a bit daunting. So, like most things, I relied on data to help me out. I spent some time researching strategy, looking at projections, and even simulating some drafts. Draft day came and I felt pretty good about my team...but now I am currently 0-3 for the season. Ha.
The MIMIC II database demo is a subset of 4,000 (of over 32,000) patients from the MIMIC II database. These data are located here: http://physionet.org/mimic2/demo/.
No living patients are included in the demo subset (although many of these patients lived for up to several years followingi their ICU admissions documented in this data set). Although these data are exempt from HIPAA requirements for protecting health information of living individuals, the data have been very carefully deidentified, and we have removed free-text notes and reports as a further measure to reduce the possibility of disclosing information that might be used to identify these patients.
Spark is a really awesome tool to easily do distributed computations in order to process large-scale data. To be honest, most people probably don't need spark for their own side projects - most of these data will fit in memory or work well in a traditional database like PostgreSQL. That being said, there is a good chance you might need Spark if you are doing data science type work for your job. A lot of companies have a tremendous amount of data and Spark is a great tool to help effectively process these large data.
Logistic Regression and Gradient Descent¶
Logistic regression is an excellent tool to know for classification problems. Classification problems are problems where you are trying to classify observations into groups. To make our examples more concrete, we will consider the Iris dataset. The iris dataset contains 4 attributes for 3 types of iris plants. The purpose is to classify which plant you have just based on the attributes. To simplify things, we will only consider 2 attributes and 2 classes. Here are the data visually:
INTRODUCTION TO PYTHON FOR DATA MINING¶
Python is a great language for data mining. It has a lot of great libraries for exploring, modeling, and visualizing data. To get started I would recommend downloading the Anaconda Package. It comes with most of the libraries you will need and provides and IDE and package manager.