Reproducible Research: Introduction to Containers

Session Overview

  • The Building Blocks of Reproducible Research
  • Why use Containers?
  • Intro to How Containers Work
  • Intro to Building Containers
  • Containers and Package Management

The Building Blocks of Reproducible Research

To build a reproducible scientific research project, we use:

  • Git: ensures our code is reproducible!
  • Package/Environment Managers: ensure our required scientific computing libraries are reproducible!
  • Containers: ensure that every remaining bit of our computing environment is reproducible!

Why Use Containers?

Containers:

  • Allow you to “Write Code Once, Deploy Anywhere”
  • Get rid of the “It Works on My Computer” Problem
  • Run On-Prem, In the Cloud, and on Microservice Platforms

Specifically, containers do two things:

  1. Package your work into reproducible, portable units.
  2. Allow you to deploy your work as a microservice.

Why Use Containers in the Cloud?

To Scale on the Cloud we need Code that is:

  • Portable
  • Modular
  • Version Controlled

Session Overview

  • The Building Blocks of Reproducible Research
  • Why use Containers?
  • Intro to How Containers Work
  • Intro to Building Containers
  • Containers and Package Management

How containers work

How containers work

How containers work

How containers work

Where Containers Run

Session Overview

  • The Building Blocks of Reproducible Research
  • Why use Containers?
  • Intro to How Containers Work
  • Intro to Building Containers
  • Containers and Package Management

Building Containers

Building Containers: Defining Dockerfiles

FROM python:3.9-slim

# Get Rust
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y

ENV PATH="/root/.cargo/bin:${PATH}"

# Install Python dependencies
COPY requirements.txt ./requirements.txt
RUN pip install -r requirements.txt --use-feature=2020-resolver

# Add app to image
COPY . ./
CMD gunicorn -b 0.0.0.0:8050 app:server -w 4

Key Commands:

  • FROM
  • RUN
  • ENV
  • COPY
  • CMD

Building Containers: Defining Dockerfiles, continued

Some projects require lots of software installations. For instance, a single machine learning project may require:

  • C, C++, and Fortran compilers;
  • Relevant C, C++, and Fortran libraries;
  • Hardware drivers and CUDA for interfacing with GPUs during model training;
  • Python libraries for executing machine learning code and analyzing results;
  • Some final, extremely obscure scientific computing library that has that one tool that you really need.

That’s a lot for one researcher to install. This is where the magic of Docker Images come in.

Building Containers: using Docker Images

Recall that Dockerfiles tell Docker what to put inside a Docker Image, which is then saved until it needs to be spun up into a container.

Chances are that if you need to get a piece of software into an image, someone else has needed it, too.

Pre-built Docker Images are available from reputable registries. Using them expedites your ability to build the Docker Image that you need for your specific application. We use the FROM command to pull the Docker base image.

FROM pangeo/ml-notebook:latest

Docker Registry

Session Overview

  • The Building Blocks of Reproducible Research
  • Why use Containers?
  • Intro to How Containers Work
  • Intro to Building Containers
  • Containers and Package Management

Building containers: using Python packages and environments

You may have noticed that our Dockerfile included a call to pip. Why is that?

It’s because Docker Containers, at their core, emulate extremely lightweight operating systems with nothing extra installed.

Thus, you need to install anything that you need, including Python libraries.

To do install what we need, we call pip install and pass a list of our desired Python libraries to Docker via requirements.txt, which we usually store in the same directory as the Dockerfile.

Frequently, pip is sufficient for our needs when it comes to creating Docker Images and running containers. Sometimes, though, we may need additional capabilities…

Package & Environment Management

  • Isolated Environments that handle dependencies, conflicts, and package versions.

  • Are Portable and Version Controlled (text-based)

Python Package and Environment Managers

  • Conda
  • Pyenv
  • Virtual Environments
    • Primer
    • Guide
  • Poetry

Docker Resources

  • Docker Docs
  • Ultimate Cheat Sheet
  • Docker Cheat Sheet
  • Docker Cheat Sheet 2
  • RStudio Guide to Docker
  • Docker 101
  • Getting Started with Docker
  • Docker for Data Scientists

Docker Resources (Continued)

  • Course Reference Guide
  • Reproducible Research Cloud Chat

Earth System Data Science in the Cloud

Reproducible Research: Introduction to Containers

  1. Slides

  2. Tools

  3. Close
  • Reproducible Research: Introduction to Containers
  • Session Overview
  • The Building Blocks of Reproducible Research
  • Why Use Containers?
  • Why Use Containers in the Cloud?
  • Session Overview
  • How containers work
  • How containers work
  • How containers work
  • How containers work
  • Where Containers Run
  • Session Overview
  • Building Containers
  • Building Containers: Defining Dockerfiles
  • Building Containers: Defining Dockerfiles, continued
  • Building Containers: using Docker Images
  • Docker Registry
  • Session Overview
  • Building containers: using Python packages and environments
  • Package & Environment Management
  • Docker Resources
  • Docker Resources (Continued)
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • ? Keyboard Help