An Often Overlooked Data Science Skill
Sep 28, 2019 15:43 · 962 words · 5 minute read
Source: Pexels
You have just started your first job as a data scientist and you are excited to start using your random forest skills to actually make a difference. You get all setup — primed to start your Jupyter Notebook — only to realize you first need to “SSH” into a different machine to run your models. The company leverages cloud computing in order to execute on machine learning at scale. You have surely heard of AWS, Google Cloud Compute, and Microsoft Azure, but have little to no experience running your models on remote machines.
This scenario may be surprising to some, but I see it all the time. School projects tend to focus on problems that can be reasonably run on a laptop and even experienced data scientists from domains with smaller datasets have survived with running models on his or her laptop.
As models become more and more data-hungry, though, it is becoming increasingly likely that data scientists will need to be comfortable and efficient when working on remote machines (cloud or on-premise). Here is how I would get started developing these skills.
Source: Pexels
Learn the Basics of the Terminal
Whether you are on OSX, Linux, or Windows you can now access a Linux/Unix based terminal (if you are on Windows see here). On a Mac, press command+space bar to open your search bar, type “Terminal” and then hit enter. You should now feel like you are in the Matrix — a real coder. The terminal will allow you to type commands directly to the computer without needing any graphical support. This is extremely valuable when working with a remote machine. You will use the terminal to connect to the machine, send it commands, and navigate your files.
There is an excellent book called Data Science at the Command Line and I would highly recommend reading the Getting Started chapter. Just this chapter alone will get you comfortable navigating your machine directly from the terminal.
If you are feeling really adventurous, take some time to modify your terminal with zsh and iterm2.
Discover Vim
Vim is a text editor that you can use directly from the terminal and comes pre-installed on most Linux systems or is very easy to install.
Vim has a decent learning curve, but even just learning the basics will allow you to quickly make changes to files from the terminal. This can save you a ton of time when working from a remote machine.
To get started, check out this interactive guide.
You could also learn emacs — if you hate your pinky.
SSH and SCP
SSH is a command you can run from your terminal to connect to a remote machine. SCP is a command that will allow you to copy data from your local machine (such as your laptop) to your remote machine. Being data scientists, an easy way to move data between machines is very valuable.
There are a ton of great articles to introduce you to these commands. Here is a tutorial I like.
Git
I am finding more and more data scientists are starting their first job with at least some knowledge of Git, which is amazing. If you have not worked much with Git, stop now, and create a GitHub account.
Source control for your code becomes even more important when working with remote machines because where you run your code could change daily (especially on the cloud) and Git will allow you to easily track, clone, and merge changes across any machine you are working on.
This GitHub guide is a great place to start. If you feel comfortable with that guide, you are in a pretty good place with the basics. And in my experience, the basics of Git can get you a long way.
Screen
Screen is a simple yet very useful tool. Once you learn about SSH and connect to your first remote machine, you might learn the hard way that if your connection breaks, anything running on that machine will die. This is not ideal for long-running data science jobs.
Screen is the answer (note: there are other similar and more advanced tools such as tmux, but I have found screen to be the easiest for beginners)
Screen allows you to run your commands in a process that won’t die if you are disconnected.
It is easy to get started and you can learn how here.
Source: Pexels
Putting it all Together
Up to this point, we have discussed the essential tools for working on a remote machine. Experience is the best teacher, so the most important thing I would recommend is actually using these tools. To do so, I would start a free account with Google Cloud Compute. I recommend Google because its free account will allow you to access GPU compute, which is really useful if you enjoy deep learning.
Here is a great article to walk you through this process:
Setting Up a Google Cloud Instance GPU for fast.ai for Free
_EDIT* This guide was written for fastai version 1, which at the current date and time (Jan 2018) is in the midst of…_medium.com
It will take time to walk through all these steps and you will surely hit points of confusion. If you persevere, though, you will have created a free to use, remote machine with a GPU. You can then practice accessing it from the terminal with SSH, edit files with Vim, get data to your machine with SCP, move and track code with Git, and use screen to not lose your processes if you disconnect.
Getting comfortable with this type of data science workflow will prepare you well for the age of big data and large compute and make you effective extremely fast when starting your first job.