Photo by Mike Enerio on Unsplash
I remember my first data science project. It was fun, I learned a ton, but in the end, it was an absolute disaster. Looking back, I realized that how I spent my time during the first day of that project had an enormous impact on its failure. In my excitement, I had skipped any serious amount of exploratory data analysis and went straight for the most complex and insane model possible. While fun, this meant that I spent the next weeks trying to untangle understanding from my data and model. Since then, I have spent a lot of time thinking of a system that has allowed me to effectively use the first day to set any project up for success.
Exploratory Data Analysis (Yes — even for deep learning)
It is so easy to convince yourself to skip exploratory data analysis and come back to it later. That is a tremendous mistake. This is the process which will inform all of your decisions later. In your first 8 hours, I would plan on spending 2–4 hours just on this step. Here are some key points not to miss:
- Look at many examples of your data. Take input and output pairs at random and just look at them. Pay attention to how you process this information and take notes on potential features or model architectures that would capture these patterns. Also, pay attention to noise and corruption in order to appropriately filter it out.
- Leverage the power of Pandas and Seaborn to calculate summary statistics, plots, and correlations. If you need some inspiration, check out Python for Data Analysis.
- Compile your findings in a Jupyter notebook and share your what you learned with a colleague. This is an amazing way to get additional thoughts and ideas. Don’t worry about the cleanliness of the notebook. Take 15 minutes to quickly organize it and another 15 minutes to share it. You will be amazed by how helpful another set of eyes can be.
Define A Single Metric for Optimization
This step sounds easy. Every Kaggle challenge and academic data set has a single metric that is already provided. So — if its a classification problem, F1 sounds reasonable and for regression mean squared error. Done. Right? Not so fast.
The metric you choose here has enormous implications. It will be the value you use to determine whether one model is better than another and if chosen poorly can lead you astray. Take the time to consider what you actually want from your predictions. For example, I recently worked on a project for which we decided we needed extremely high precision. Now, it is easy to maximize precision at the expense of recall, so we also had to define a sufficient metric. A sufficient metric is something which has to be met, but not maximized. For example, you might decide to maximize precision given at least a 50 percent recall.
As you are thinking about your optimization metric, it can be very easy to justify having multiple metrics. Avoid this at all costs. Later on, you may, in fact, discover that additional metrics are needed, but on your first day, multiple metrics will add too much subjectivity to your process. You want one metric and perhaps a few sufficient metrics. That is all.
It is also important to note that you are not stuck with the metric you choose forever. You can always make changes later as you learn more about your problem. Do you best to make a good decision now, and move to the next step. This process usually takes about 30 minutes of your first 8 hours.
Split Your Data Intelligently
Wait — don’t we just use scikit-learn’s train_test_split function? While that might be sufficient, please take the time to consider the following:
- Do you want to randomly sample your data? Or is there a temporal component that needs to be considered? For example, if you want to predict stock market prices, you better split your data based on time and not randomly. Otherwise, you will be including future information.
- Do you want to stratify your samples? This is usually important for classification problems.
- Make sure you have a train, dev, and test set. Your test set should be held out as long as possible and is the real test of your model. The dev set can be used for error analysis and other optimizations. If you only have a train and test set, you will end up using your test set for error analysis and have an overly optimistic estimate of performance.
- Once split, take some time to look at the separate samples and run some statistics to make sure they all seem representative of the problem. If your dev and training sets, for example, are very different, you will have a hard time learning from dev predictions.
- Consider how representative your data are to the actual problem you want to solve. Assuming at some point your model will be customer-facing, think about whether the data they will be sending your service are similar to the data you have. For example, if they could accidentally send you an upside-down photo, does your training data have upside-down photos? Or do you use a process to align all the images?
This process might take you up to an hour. Again — an hour isn’t enough time to get this perfect but is usually enough to have a good foundation.
Build Your First Model!
Assuming you took the full 4 hours for exploratory data analysis, you have spent 5.5 hours up to this point. In our 8 hours, that leaves 2.5 hours. We are going to use 2 hours to get your first model up and running.
Define a non-machine learning baseline
Take 15 minutes to quickly put together a baseline that doesn’t involve machine learning. For classification, you can use the most frequent class as the prediction. For regression, the average. Or if you have obvious groups, then calculate average/frequencies by the group as your prediction. For example, if you know a customer’s age, you can define age buckets and calculate the average spend for each bucket as your prediction for a new user’s spend in that age bucket.
Having this baseline model will be extremely effective in understanding the effectiveness of your model.
Start incredibly simple
Pick the simplest machine learning model you can in order to start. The easier to understand, the better. For example, linear or logistic regression for structured data or a simple convolution model or sequence model for unstructured data. Your goal here is just to get a model learning and get results on your dev set in 1 hour. The fact that this must be done in about an hour should force you to stay very simple.
If your data are very large and training takes a long time, sample your training set to get some results quickly. Again — you want something working very quickly.
Randomly sample some of your dev set, and make your own predictions. This will be your approximation of a human baseline. Spend about 30 minutes on this task.
Calculate how well your baseline, machine learning model, and human predictions did on your chosen metric. I have also found it useful to create scatter plots of actual values vs. predictions to see the correlation. You should be able to do this in about 15 minutes.
Photo by Cathryn Lavery on Unsplash
Write What You Learned
You have 30 minutes left in your day. Take that time to write down what you learned during this process and the next steps you want to take. You will be amazed at how helpful this will be when you come back to work on it the next day. If you have a shared wiki such as confluence, make sure to upload your research notes for others to access in case they work on this problem at some point.
There you have it — 8 hours and a very effective first day for your next machine learning project. In just one day you have built a good understanding of your data, defined a solid metric, appropriately split your data, and have evaluated multiple types of models (non-machine learning, machine learning, and human). You are now extremely well prepared to start fine-tuning and building a truly great data science project.