I/O and Data Formats

Welcome Back

Yesterday:

  • Cloud9
  • Intro to AWS
  • Team Git Repo

Welcome Back

Today:

  • I/O
  • Containers
  • Guest Speaker (Thanks Douglas!)
  • Programmatic Cloud Access
  • Team Projects

Session Overview

  • Input/Output
  • Data Formats
    • Parquet
    • Zarr
    • COG

Input/Output

Effective and Efficient Input and Output will make or break a project.


Avoid if possible:

  • Iteratively reading files into a large object with a sequential loop.
  • Reading in a monolithic large file only to extract summary statistics on one variable.
  • Reading in a large file, then saving it to a new in-memory object with each processing step.

Data Formats

The Foundation for Performant Data Science:

Performant Data Formats

  • Push-down filtering
  • Parallel Access
    • Subfiles make up complete dataset
  • Lazy-loading

Components of Performance

  • Size
  • Speed
  • Zero (or close to zero) Copy

Components of Performance

  • Size
  • Push-Down Filtering (Read what you need)
  • Speed
  • Parallel Access
  • Zero (or close) Copy
  • Lazy Loading

Performance Takeaways

Seconds matter…


Big Data is just a lot of Small Data


The corollary: To solve big data problems, make them small data problems.

Session Overview

  • Input/Output
  • Data Formats
    • Parquet
    • Zarr
    • COG

Parquet

  • Apache Parquet
  • Columnar Data Store
  • Tabular Data
  • The Better CSV
  • Orders of Magnitude Improvements in Size and Speed
  • Lazy loading in R (Vroom) and Python (Polars, Dask)
  • Stable, well supported, community
  • Underlying foundation for many AWS Serverless Offerings

Zarr

  • Zarr Project
  • Chunked, compressed, N-dimensional arrays (think Numpy)
  • Python
  • Push-down Filtering, Parallel, and Lazy Loading
  • Dask
  • Standard for Gridded Data

Cloud Optimized GeoTIFF (COG)

  • COG Project
  • Images (GeoTIFF)
  • Python and R
  • Push-down Filtering, Parallel, and Lazy Loading
  • Stable, embraced heavily by all three major CSPs

Data Conversion

  • Do it yourself: Read it in as one format, export to another:
    • AWS Lambda is very good at this.
  • Use conversion tools: