I/O and Data Formats

Welcome Back

Yesterday:

Cloud9
Intro to AWS
Team Git Repo

Welcome Back

Today:

I/O
Containers
Guest Speaker (Thanks Douglas!)
Programmatic Cloud Access
Team Projects

Session Overview

Input/Output
Data Formats
- Parquet
- Zarr
- COG

Input/Output

Effective and Efficient Input and Output will make or break a project.

Avoid if possible:

Iteratively reading files into a large object with a sequential loop.
Reading in a monolithic large file only to extract summary statistics on one variable.
Reading in a large file, then saving it to a new in-memory object with each processing step.

Data Formats

The Foundation for Performant Data Science:

Performant Data Formats

Push-down filtering
Parallel Access
- Subfiles make up complete dataset
Lazy-loading

Components of Performance

Size
Speed
Zero (or close to zero) Copy

Components of Performance

Size

Push-Down Filtering (Read what you need)

Speed

Parallel Access

Zero (or close) Copy

Lazy Loading

Performance Takeaways

Seconds matter…

Big Data is just a lot of Small Data

The corollary: To solve big data problems, make them small data problems.

Session Overview

Input/Output
Data Formats
- Parquet
- Zarr
- COG

Parquet

Apache Parquet
Columnar Data Store
Tabular Data
The Better CSV
Orders of Magnitude Improvements in Size and Speed
Lazy loading in R (Vroom) and Python (Polars, Dask)
Stable, well supported, community
Underlying foundation for many AWS Serverless Offerings

Zarr

Zarr Project
Chunked, compressed, N-dimensional arrays (think Numpy)
Python
Push-down Filtering, Parallel, and Lazy Loading
Dask
Standard for Gridded Data

Cloud Optimized GeoTIFF (COG)

COG Project
Images (GeoTIFF)
Python and R
Push-down Filtering, Parallel, and Lazy Loading
Stable, embraced heavily by all three major CSPs

Data Conversion

Do it yourself: Read it in as one format, export to another:
- AWS Lambda is very good at this.
Use conversion tools:
- Pangeo Forge
- Kerchunk