Welcome Back
Yesterday:
- Cloud9
- Intro to AWS
- Team Git Repo
Welcome Back
Today:
- I/O
- Containers
- Guest Speaker (Thanks Douglas!)
- Programmatic Cloud Access
- Team Projects
Session Overview
- Input/Output
- Data Formats
Session Overview
- Input/Output
- Data Formats
Parquet
- Apache Parquet
- Columnar Data Store
- Tabular Data
- The Better CSV
- Orders of Magnitude Improvements in Size and Speed
- Lazy loading in R (Vroom) and Python (Polars, Dask)
- Stable, well supported, community
- Underlying foundation for many AWS Serverless Offerings
Zarr
- Zarr Project
- Chunked, compressed, N-dimensional arrays (think Numpy)
- Python
- Push-down Filtering, Parallel, and Lazy Loading
- Dask
- Standard for Gridded Data
Cloud Optimized GeoTIFF (COG)
- COG Project
- Images (GeoTIFF)
- Python and R
- Push-down Filtering, Parallel, and Lazy Loading
- Stable, embraced heavily by all three major CSPs
Data Conversion
- Do it yourself: Read it in as one format, export to another:
- AWS Lambda is very good at this.
- Use conversion tools: