Docker for Data Science

A Data Scientist’s Guide to Using Docker

Docker for Data Science
Share on facebook
Share on twitter
Share on linkedin
Share on email

Anyone who’s ever set up a data science work environment knows one thing to be true: it’s a painful process. If you’ve ever set up Tensorflow or CNTK with GPU support locally, you know exactly what I mean—the installation instructions alone are overwhelming, and there are countless combinations and permutations of the required components. And, worst of all, mistakes typically won’t rear their ugly heads until after you’ve finished the last step of the install.

Now, let’s assume you overcome these installation hurdles and get your work environment set up correctly: you’ll likely find that sharing your work is still a hassle, because anyone who wants to run your code needs to replicate your exact working environment on their own machine to function correctly.

However, thanks to an underutilized free tool called Docker, the days of tedious environment setups are coming to an end. From installation to sharing work, Docker makes a data scientist’s life much easier, so they can focus on doing what they do best—data science. Here’s how:

The Docker Hierarchy

At the highest level, Docker allows you to package up not only your code, but also the environment used to run it.  It does this via a hierarchy of elements:

  • Dockerfile: a script that describes the environment (packages, files, etc.) and is used to create a Docker image
  • Docker image: a compiled version of the Dockerfile, a snapshot of the described environment
  • Docker container: an instantiated Docker image, similar to a light-weight virtual machine
  • Docker Hub: a repository of pre-built images ready for use

Docker’s Painless Environment Setup

Whether you’re using a base install of Anaconda or building your own data science super machine from Ubuntu, Docker is the tool you need for a quick, easy start.  Getting your environment set up and primed for data science can be done with two simple lines of code:

  1. Docker pull
  2. Docker run

The first line pulls a published Docker image of whichever environment you want to use into your local machine from Docker Hub.  The second line will run this image, producing a pseudo-virtual machine to conduct your experiments in. Once the image is running in a container, you can open your favorite IDE inside of it and get to work immediately.  You can also build your own images but keep this in mind: for most things you want to do, there’s likely a published image out there already—including everything from Tensorflow to Tesseract OCR.

Docker’s Hassle-Free Work Sharing

Data science is rarely done (well) by a lone genius. To solve tough problems, it often takes a team of people contributing ideas, insights, and ingenuity to produce an elegant solution. And with Docker, this collaboration is easier than ever.

There are three main ways to share your Dockerized work:

  1. You can publish an image of your machine to Docker Hub where anyone can grab it with a ‘docker pull’ (Note: Docker Hub is a public repository).
  2. You can write your image to a .tar file of the image and share it; then, when your colleague gets the file, they can replicate the image locally from the .tar file.
  3. If you built your own image or have access to the build environment (many Docker Hub images do) you can zip up and share the build files. With these files, your colleagues can replicate your image with a ‘docker build.’

How to Get Started With Docker

To get started, you’ll need to install Docker (anyone you share experiments with will also need to install Docker).  It provides built-in examples to try out, but I recommend grabbing a Dockerfile from your favorite repo, or Docker Hub, and trying that out instead; I personally test drove Docker with a tool I happened to use at the time named glyphminer.  Some quick commands you should know:

  • To pull an existing image from Docker Hub: docker pull …
  • To build an image from a Dockerfile: docker build …
  • To display all images docker images
  • To run a container from a built image: docker run …
  • To display all containers: docker ps -a

You can also read Docker’s tutorial guides for more detailed, step-by-step instructions on getting started. Trust me—once you realize how many headaches Docker can spare you, you’ll be glad you did.

Stay Up-To-Date