Beginning A Project

Session Overview

  • Defining a project
  • Introduction to EDA
  • Choosing a language
  • Finding Data
  • Accessing Data
  • Patterns of Analysis
  • Introduction to Data Formats

Defining A Project

Ingredients for a successful project:

  • Research Question!
  • Data
  • Analysis, AI/ML, etc.
  • Products

Thinking about Products:

  • Publication
  • Dataset
  • Pipeline

Exploratory Data Analysis

(EDA)

  • Exploratory Data Analysis
  • Quantity
  • Quality

EDA

The first step in any project.

  • Initial investigation into the data

  • Check assumptions

  • Spot patterns

  • Find anomalies

  • Identify problems before investing time

EDA Questions

Answer:

  • Will these data help us answer my research question?

Are there gaps? Are there assumptions in the data that prevent me from answering my question? Do I need more data?

  • What challenges might we encounter using these data?

  • What are our data boundary conditions?

If, for example, we wanted to use these data for ML, where might we have to be careful applying/extending these models? (If we want to predict something in Alaska, but all our data are from the continental US, that could be a problem.)

EDA Principles

Assess:

  • Quantity

  • Quality

Quantity

  • Volume (How Much?)

  • Velocity (How fast/frequently?)

  • Variety (Formats, data types?)

Quality

  • Veracity (Do we trust these data?)

  • Complete (Missing data?)

  • Corrupt (Files ok?)

  • Data Patterns

    • Distributions
    • Outliers
    • Drift

Missing Data Tools

R: Nanair

Python: missingno

Session Overview

  • Defining a project
  • Introduction to EDA
  • Choosing a language
  • Finding Data
  • Accessing Data
  • Patterns of Analysis
  • Introduction to Data Formats

Data Science Languages

R

  • Statistical Programming Language (from S)
  • *Verse Ecosystem (tabular data + models)
  • Slower Adoption, but very well done
  • RStudio drives development
  • Top of class ML (not Deep)
  • Really nice interface for Bayesian work

Python

  • General Programming Language (Thank you Guido)
  • Awesome Glue language
  • Very rapid development, but sometimes has issues
  • Multifaceted development, watch Pangeo, Anaconda, many others
  • Everywhere
  • Best for deep learning

Both of these languages share common characteristics that make them successful and languages of choice for Data Science:

  • Open Source
  • Scripted
  • Really good REPL
  • Literate Programming Support
  • Flexible Extensions/Package/Library ecosystems

Session Overview

  • Defining a project
  • Introduction to EDA
  • Choosing a language
  • Finding Data
  • Accessing Data
  • Patterns of Analysis
  • Introduction to Data Formats

Finding Data

NOAA Open Data Dissemination

Other Sources

  • USGS, DOC, Census, USDA, NASA, data.gov, etc.

Accessing Data

What to look for in your data?

  1. Has the information you need.
  • Measures what you want.
  • Is complete.
  • Is spatiotemporally appropriate.

Accessing Data

What to look for in your data?

  1. Is accessible.
  • Cloud Object Storage
  • Format.
  • No throttling.

Accessing Data

Patterns of Analysis

On-Premise

  • Download Data
  • Load into memory
  • Conduct Analysis
    • Single Threaded
    • Multi Threaded
    • Scheduling on HPC
  • Write new data/figure to disk

Cloud

  • Connect to data
  • Write/Map Analysis
  • Run Analysis
    • Lazy Loading
    • Natively and Massively Parallel
    • Portable Pipeline
  • New data/figure stays on cloud

On-Premise

Advantages

  • Easy (Low complexity)
  • Faster for small datasets
  • Not dependent upon internet

Disadvantages

  • Does not scale
  • Hard to collaborate
  • Not production ready

Cloud

Advantages

  • Eminently scalable
  • Cheaper at scale (only pay for what you use)
  • Low barriers to entry
  • Collaborative
  • Easier path to production
  • Diversity of tooling and toolsets

Disadvantages

  • Need internet
  • Harder (more complexity)
  • Slower for small datasets

Session Overview

  • Defining a project
  • Introduction to EDA
  • Choosing a language
  • Finding Data
  • Accessing Data
  • Patterns of Analysis
  • Introduction to Data Formats

Introduction to Data Formats

Desirable Attributes in a Format:

  • Parallel Access
    • Subfiles make up complete dataset
  • Push-down filtering
  • Lazy-loading

Introduction to Data Formats

Formats (Cloud Optimized1)

  • Parquet - tabular! The better csv.
    • Geoparquet - to watch, not stable.
  • Zarr - Arrays! Awesome Arrays!, the better NetCDF?
  • Cloud Optimized GeoTiff (COG) - Images!
  • Kerchunk - Not stable, but almost there.

Introduction to Data Formats

Let’s take a look…

Introduction to Data Formats

Key Takeaways:

  • If you can find data in these formats, use them!
  • If not, it will likely be more effective and efficient to convert to these formats.

Object Storage

&

File Formats


Form the Foundation for Performant Cloud Computing

Session Overview

  • Defining a project
  • Introduction to EDA
  • Choosing a language
  • Finding Data
  • Accessing Data
  • Patterns of Analysis
  • Introduction to Data Formats