Introduction to Data Cleaning

Welcome Back

Yesterday:

Team Project Work
Managing Containers
DataViz

Welcome Back

Today:

Data Cleaning
Parallel Computing
Team Project Work

Introduction to Data Cleaning

AKA Grammar School

Session Overview

Clean Data
Grammar of Data Recap
Order of Operations
Time Complexity (Big O Notation)
Functional Programming (Mapping)
Building a Pipeline
Troubleshooting
Performance

Clean Data

The first step in Analysis
Get your data ready to be analyzed
Specific to your project
Remove/Rename/Reformat
Connect

Clean data is the same as raw data, just in a different presentation.

Session Overview

Clean Data
Grammar of Data Recap
Order of Operations
Time Complexity (Big O Notation)
Functional Programming (Mapping)
Building a Pipeline
Troubleshooting
Performance

Why?

If you can think through what you want to do, you can code much better.
You can use building blocks of grammar to build sentences, then paragraphs…
Order of Operations (Efficacy & Efficiency)
Communication…

Grammar of Data Manipulation

Perspectives:

Tabular Data
Gridded Data
Image Data

Verbs

SELECT
SORT
FILTER (WHERE)
RENAME (ALIAS)
UNIQUE (DISTINCT)
HEAD/FIRST (LIMIT)
GROUP BY
AGGREGATE (SUMMARIZE)
PIVOT/TRANSPOSE

Session Overview

Clean Data
Grammar of Data Recap
Order of Operations
Time Complexity (Big O Notation)
Functional Programming (Mapping)
Building a Pipeline
Troubleshooting
Performance

Order of Operations

Reduce the size of your data before the more computationally intensive steps.

Do These:

SELECT
FILTER
UNIQUE
HEAD/FIRST/LIMIT

Before These:

JOIN
GROUPBY/AGGREGATE

Session Overview

Clean Data
Grammar of Data Recap
Order of Operations
Time Complexity (Big O Notation)
Functional Programming (Mapping)
Building a Pipeline
Troubleshooting
Performance

Time Complexity

How an algorithm scales with more inputs.

Constant Time | O(1)

Accessing an indexed value:

First, Push, Pop, Actual index reference

Linear Time | O(n)

Counting
Moving through an array or dataframe

Quadratic + Time | O(n^*)

Nested for loops

for i in 1:n:
  for j in 1:n:
    run your code!

How many times will your operation run?

Takeaway

Use
Dataframe/Array
Operations whenever possible.
Map Reduce

Session Overview

Clean Data
Grammar of Data Recap
Order of Operations
Time Complexity (Big O Notation)
Functional Programming (Mapping)
Building a Pipeline
Troubleshooting
Performance

Functional Programming

Object Oriented Programming

Object Based
Data is stored in objects in attributes
Data are updated in objects
Many things, few operations

Functional Programming

Function Based
Functions are Mapped to immutable data
Many operations, few things

Blending Programming

Yes, And!

Using the best of both worlds: Mapping functions to inputs.

FP Case Study

Select IQR from each Atomic Unit of Parallelization:

county_data
  .groupby(county)
  .map(fun = getIQR, data = Temp)

Or:

groups = {counties, values}

for i in rows(county_wide):
  if county_wide.label is in groups:
      Add Temp value to group
  else
      Add group and Temp value
      
for group in groups:
  Q1 = first_quartile(group)
  Q3 = third quartile(group)
  for element in group_values:
      if element is greater than Q1
      And element is less than Q3:
      KEEP
  else
    Discard

Building a Pipeline

# Set parallel environment
cores/workers = 12
backend_connection = 'local'

our_pipeline_map = (
  our_data
    .select data
    .unique observations
    .filter spatio-temporally
    .drop missing values # If needed/wanted
    .groupby atomic unit of parallelization
    .aggregate
        apply functions: median, max, linear models...
    )
    
our_result = compute(our_pipeline_map)

Session Overview

Clean Data
Grammar of Data Recap
Order of Operations
Time Complexity (Big O Notation)
Functional Programming (Mapping)
Building a Pipeline
Troubleshooting
Performance

Troubleshooting

Remembering to set up/close backend
RAM requirements
Environment (packages) on the workers
Monitoring: GTOP, Dask UI, Lambda UI

Start Small

Start Small

Split Troubleshooting between process and parallelization.
Run one single unit of atomic parallelization.
Scale in sequence.
Scale in Parallel - Smally: small number of units, small number of workers.
Benchmark
Estimate pricing

Introduction to Data Cleaning

Welcome Back

Welcome Back

Introduction to Data Cleaning

Session Overview

Clean Data

Session Overview

Why?

Grammar of Data Manipulation

Verbs

Session Overview

Order of Operations

Session Overview

Time Complexity

Constant Time | O(1)

Linear Time | O(n)

Quadratic + Time | O(n*)

Takeaway

Session Overview

Functional Programming

Blending Programming

FP Case Study

Building a Pipeline

Session Overview

Troubleshooting

Start Small

Performance

Quadratic + Time | O(n^*)