A Data Scientist’s Guide to Using Docker

James Parks

10 Articles

Author's Articles

How to Identify Machine Learning Use Cases in Your Business

At this point, we’ve all heard of machine learning. It’s highlighted in business news, your LinkedIn feed is riddled with posts on it, and chances are

You’ve Done a Data Science, ML or AI Proof of Concept (PoC)—Now What?

While some early data science investors—like Amazon, Google and Capital One—have reaped the financial rewards by going all in on their data science in

Why Apache Spark & Azure Databricks are the Ideal Combo for Analytics Workloads

As a data scientist or a manager looking to maximize analytics productivity, you’ve probably heard—or even said—this before: “It should be done soon,

A Data Scientist’s Guide to Using Docker

James Parks

10 Articles

Share this content:

A Data Scientist’s Guide to Using Docker

7 min read

April 26, 2023

Docker

Data

Containerization

Jupyter Notebooks

Anyone who’s ever set up a data science work environment knows one thing to be true: it’s a painful process. If you’ve ever set up Tensorflow or CNTK with GPU support locally, you know exactly what I mean—the installation instructions alone are overwhelming, and there are countless combinations and permutations of the required components. And, worst of all, mistakes typically won’t rear their ugly heads until after you’ve finished the last step of the install.

Now, let’s assume you overcome these installation hurdles and get your work environment set up correctly: you’ll likely find that sharing your work is still a hassle, because anyone who wants to run your code needs to replicate your exact working environment on their own machine to function correctly.

However, thanks to an underutilized free tool called Docker, the days of tedious environment setups are coming to an end. From installation to sharing work, Docker makes a data scientist’s life much easier, so they can focus on doing what they do best—data science. Here’s how:

The Docker Hierarchy

At the highest level, Docker allows you to package up not only your code, but also the environment used to run it. It does this via a hierarchy of elements:

Dockerfile: a script that describes the environment (packages, files, etc.) and is used to create a Docker image
Docker image: a compiled version of the Dockerfile, a snapshot of the described environment
Docker container: an instantiated Docker image, similar to a light-weight virtual machine
Docker Hub: a repository of pre-built images ready for use

Docker’s Painless Environment Setup

Whether you’re using a base install of Anaconda or building your own data science super machine from Ubuntu, Docker is the tool you need for a quick, easy start. Getting your environment set up and primed for data science can be done with two simple lines of code:

Docker pull
Docker run

The first line pulls a published Docker image of whichever environment you want to use into your local machine from Docker Hub. The second line will run this image, producing a pseudo-virtual machine to conduct your experiments in. Once the image is running in a container, you can open your favorite IDE inside of it and get to work immediately. You can also build your own images but keep this in mind: for most things you want to do, there’s likely a published image out there already—including everything from Tensorflow to Tesseract OCR.

Docker’s Hassle-Free Work Sharing

Data science is rarely done (well) by a lone genius. To solve tough problems, it often takes a team of people contributing ideas, insights, and ingenuity to produce an elegant solution. And with Docker, this collaboration is easier than ever.

There are three main ways to share your Dockerized work:

You can publish an image of your machine to Docker Hub where anyone can grab it with a ‘docker pull’ (Note: Docker Hub is a public repository).
You can write your image to a .tar file of the image and share it; then, when your colleague gets the file, they can replicate the image locally from the .tar file.
If you built your own image or have access to the build environment (many Docker Hub images do) you can zip up and share the build files. With these files, your colleagues can replicate your image with a ‘docker build.’

How to Get Started With Docker

To get started, you’ll need to install Docker (anyone you share experiments with will also need to install Docker). It provides built-in examples to try out, but I recommend grabbing a Dockerfile from your favorite repo, or Docker Hub, and trying that out instead; I personally test drove Docker with a tool I happened to use at the time named glyphminer. Some quick commands you should know:

To pull an existing image from Docker Hub: docker pull …
To build an image from a Dockerfile: docker build …
To display all images docker images
To run a container from a built image: docker run …
To display all containers: docker ps -a

You can also read Docker’s tutorial guides for more detailed, step-by-step instructions on getting started. Trust me—once you realize how many headaches Docker can spare you, you’ll be glad you did.

Share this content: