I/O and Data Formats

Welcome Back

Yesterday:

  • Coder
  • Intro to AWS
  • Team Git Repo

Welcome Back

Today:

  • I/O
  • Containers
  • Parallel Computing Intro
  • Programmatic Cloud Access
  • Team Projects

Session Overview

  • Input/Output
  • Data Formats
    • Parquet
    • Zarr/Icechunk
    • Visualization Formats

Input/Output

Effective and Efficient Input and Output will make or break a project.


Avoid if possible:

  • Iteratively reading files into a large object with a sequential loop.
  • Reading in a monolithic large file only to extract summary statistics on one variable.
  • Reading in a large file, then saving it to a new in-memory object with each processing step.

Data Formats

The Foundation for Performant Data Science:

Performant Data Formats

  • Push-down filtering
  • Parallel Access
    • Subfiles make up complete dataset
  • Lazy-loading

Components of Performance

  • Size
  • Speed
  • Zero (or close to zero) Copy

Components of Performance

  • Size
  • Push-Down Filtering (Read what you need)
  • Speed
  • Parallel Access
  • Zero (or close) Copy
  • Lazy Loading

Performance Takeaways

Seconds matter…


Big Data is just a lot of Small Data


The corollary: To solve big data problems, make them small data problems.

Session Overview

  • Input/Output
  • Data Formats
    • Parquet
    • Zarr/Icechunk
    • Visualization Formats

Data Formats

Cloud Native Formats

Parquet

  • Apache Parquet
  • Apache Iceberg
  • Columnar Data Store
  • Tabular Data
  • The Better CSV
  • Orders of Magnitude Improvements in Size and Speed
  • Lazy loading in R (Vroom), Python (Polars, DuckDB, Dask), and JavaScript
  • Stable, well supported, community
  • Underlying foundation for many AWS Serverless Offerings

Zarr/Icechunk

Data Formats

Cloud Native Formats

Data Conversion

In Action:

  • Parquet
  • Zarr

Coder:

  1. Spin Up Module 2 Workspace from Yesterday
  2. Clone the practice Repo Module 2 Practice (Make sure you are in your home directory)
  3. Move into the Module 2 Practice directory.
  4. Run uv sync.
  5. Activate your virtual environment.
  6. Run:
python -m ipykernel install --user --name=coder --display-name="Python (UV Environment)"

Parquet and Zarr

  1. Open up the intro-to-io/intro-io.ipynb notebook.
  2. Play