Photo by Skye Studios on Unsplash
Recently, I wrote an article with advice for breaking into the field of data science. If you are interested, you can check out the article here:
Standing Out in a Sea of Data Scientists
_Advice for breaking into the field of data science_towardsdatascience.com(https://towardsdatascience.com/standing-out-in-a-sea-of-data-scientists-c82e42a1e62b)
One of the pieces of advice was to “gain experience defining and solving a problem with machine learning from end-to-end.” I’ve had some questions on how to do this effectively, so I would love to dig in a bit deeper on how I would essentially begin to build a data science portfolio.
It is unlikely you are sitting on ample time spare time, so committing to building your data science portfolio will surely take commitment and sacrifice. In my experience, you will only be successful in doing so if you pursue a project for which you have a lot of passion. The passion doesn’t necessarily have to be obvious either. Personally, I am not passionate about writing at all, but I am passionate about sharing ideas. Writing is a good medium by which I can pursue my passion for ideas. Maybe you would really like to do a project with deep learning, but are lacking the drive. But perhaps you are passionate about music. You might start your portfolio by using deep learning to create music. Focusing your effort towards your passions can greatly help you push through when it is tempting to give up.
Define Your Own Problem
It is incredibly tempting to build side projects around predefined problems on platforms such as Kaggle. While this certainly makes the process easier, it effectively removes one of the most important parts of the data science process: defining the problem. In industry, often one of the biggest challenges is converting a business problem to a data science problem. Before writing any code, think through the following:
- What problem do I want to solve?
- How do I think I could use data science to solve this problem?
- If I could solve this problem, what value would that create?
Your answers could be as simple as, I want to generate music which sounds like my favorite band; I have done some research and it looks like deep learning has shown some success at solving this problem; if I could solve it, I would have unlimited music that sounds like my favorite artist!
This step is essential because it sets the stage for the story of your project. It will help you better explain to others why you chose this project and show your strategic thinking when tackling a problem.
Gather Your Own Data
If you defined your own problem, this step will almost certainly be mandatory. Your problem will probably be unique and thus you will need to spend some time gathering data. This is great! In my experience, I have found that I spend a significant amount of time thinking about the best way to gather data to help solve my problem. You can now showcase that skill in your project. For our music example, it might include examining the Free Music Archive which includes high-quality, legal audio downloads. I promise that by exploring how to acquire and gather your own data, you will learn a crucial step of the data science process and one that is not often taught in school.
Photo by Craig Whitehead on Unsplash
Showcase Data Exploration
As Andrej Karpathy said:
Become one with the data
One of the first steps of any machine learning project is to spend time inspecting and analyzing your data. Don’t skip this step. Not only is it important, but it also allows you to create some really great visualizations. Dig into your data and look at the following:
- Are there any outliers?
- What are the distributions of your features?
- Plot correlations between features and the target
- Look at actual examples of your data
There is a lot more you can do during this step, but those are a good place to start. Use seaborn to make your plots prettier or if you are feeling ambitious, try and make the visualizations interactive with something like Plotly. The goal here is to show others how you analyze data to uncover nuggets of wisdom others might have missed, which will make your models even better.
Build Multiple Models
Too often, I see projects only show the best model. A great portfolio project allows people to understand your thought process, so please show us! To do this effectively, I would recommend the following process:
- First, create a non-machine learning baseline. This baseline should be something reasonable like a historical average. This is a crucial step for evaluating your first machine learning-based model.
- Second, create your first machine learning model. Describe why you choose to start with that model and compare it to your baseline.
- Third, build your second machine learning model. The crucial part of this step is to clearly explain why this was the next best step to take. Was your model overfitting and thus you needed to use a less complex model or add regularization? Maybe you used the same model, but developed additional features based on error analysis.
- Fourth, repeat step three until you feel comfortable with the results.
At the end of this, you should not only have multiple models in your project, but you will have a logical story explaining how you think about developing a great machine learning model.
Tell A Story
At this point, you have a lot of the key components in place and you might feel like you are almost done. Not so fast! You now need to go back and connect all your work and tell a great story.
Great data scientists are great storytellers
This is the most important step in creating an amazing project for your portfolio. If you skip this step, you probably just have a bunch of code on GitHub. That is not a portfolio. Use a blogging platform such as Medium or you could even create your own blog, and explain your journey. Write about the goal of the project, highlight the key exploratory analyses, include your modeling results and thought process, and tell us how your project created value.
Think of this as how you would present your project to executives. You don’t need to include any code (but definitely link to your code on GitHub).
Photo by JOSHUA COLEMAN on Unsplash
You now have one beautiful project in your portfolio. All you have to do now is repeat the process. :) This does take a lot of work and time, but with consistent, dedicated effort you will find yourself with a few well-told stories using machine learning to create value in an area about which you are passionate (make sure you highlight your portfolio on LinkedIn and your resume). That is an amazing data science portfolio which will definitely help you stand out. Looking for some inspiration? Check out Tim Dettmers’ data science portfolio.