Introduction to Data Cleaning

Welcome Back

Yesterday:

  • Team Project Work
  • Managing Containers
  • DataViz

Welcome Back

Today:

  • Data Cleaning
  • Parallel Computing
  • Team Project Work

Introduction to Data Cleaning

AKA Grammar School

Session Overview

  • Clean Data
  • Grammar of Data Recap
  • Order of Operations
  • Time Complexity (Big O Notation)
  • Functional Programming (Mapping)
  • Building a Pipeline
  • Troubleshooting
  • Performance

Clean Data

  • The first step in Analysis
  • Get your data ready to be analyzed
  • Specific to your project
  • Remove/Rename/Reformat
  • Connect


Clean data is the same as raw data, just in a different presentation.

Session Overview

  • Clean Data
  • Grammar of Data Recap
  • Order of Operations
  • Time Complexity (Big O Notation)
  • Functional Programming (Mapping)
  • Building a Pipeline
  • Troubleshooting
  • Performance

Why?

  • If you can think through what you want to do, you can code much better.

  • You can use building blocks of grammar to build sentences, then paragraphs…

  • Order of Operations (Efficacy & Efficiency)

  • Communication…

Grammar of Data Manipulation

Perspectives:

  • Tabular Data

  • Gridded Data

  • Image Data

Verbs

  • SELECT
  • SORT
  • FILTER (WHERE)
  • RENAME (ALIAS)
  • UNIQUE (DISTINCT)
  • HEAD/FIRST (LIMIT)
  • GROUP BY
  • AGGREGATE (SUMMARIZE)
  • PIVOT/TRANSPOSE

Session Overview

  • Clean Data
  • Grammar of Data Recap
  • Order of Operations
  • Time Complexity (Big O Notation)
  • Functional Programming (Mapping)
  • Building a Pipeline
  • Troubleshooting
  • Performance

Order of Operations

Reduce the size of your data before the more computationally intensive steps.


Do These:

  • SELECT
  • FILTER
  • UNIQUE
  • HEAD/FIRST/LIMIT

Before These:

  • JOIN
  • GROUPBY/AGGREGATE

Session Overview

  • Clean Data
  • Grammar of Data Recap
  • Order of Operations
  • Time Complexity (Big O Notation)
  • Functional Programming (Mapping)
  • Building a Pipeline
  • Troubleshooting
  • Performance

Time Complexity

How an algorithm scales with more inputs.

Constant Time | O(1)

Accessing an indexed value:

  • First, Push, Pop, Actual index reference

Linear Time | O(n)

  • Counting
  • Moving through an array or dataframe

Quadratic + Time | O(n*)

  • Nested for loops
for i in 1:n:
  for j in 1:n:
    run your code!


How many times will your operation run?

Takeaway


Use
Dataframe/Array
Operations whenever possible.
Map Reduce

Session Overview

  • Clean Data
  • Grammar of Data Recap
  • Order of Operations
  • Time Complexity (Big O Notation)
  • Functional Programming (Mapping)
  • Building a Pipeline
  • Troubleshooting
  • Performance

Functional Programming

Object Oriented Programming

  • Object Based
  • Data is stored in objects in attributes
  • Data are updated in objects
  • Many things, few operations

Functional Programming

  • Function Based
  • Functions are Mapped to immutable data
  • Many operations, few things

Blending Programming

Yes, And!


Using the best of both worlds: Mapping functions to inputs.

FP Case Study

Select IQR from each Atomic Unit of Parallelization:

county_data
  .groupby(county)
  .map(fun = getIQR, data = Temp)

Or:

groups = {counties, values}

for i in rows(county_wide):
  if county_wide.label is in groups:
      Add Temp value to group
  else
      Add group and Temp value
      
for group in groups:
  Q1 = first_quartile(group)
  Q3 = third quartile(group)
  for element in group_values:
      if element is greater than Q1
      And element is less than Q3:
      KEEP
  else
    Discard

Building a Pipeline

# Set parallel environment
cores/workers = 12
backend_connection = 'local'

our_pipeline_map = (
  our_data
    .select data
    .unique observations
    .filter spatio-temporally
    .drop missing values # If needed/wanted
    .groupby atomic unit of parallelization
    .aggregate
        apply functions: median, max, linear models...
    )
    
our_result = compute(our_pipeline_map)

Session Overview

  • Clean Data
  • Grammar of Data Recap
  • Order of Operations
  • Time Complexity (Big O Notation)
  • Functional Programming (Mapping)
  • Building a Pipeline
  • Troubleshooting
  • Performance

Troubleshooting

  • Remembering to set up/close backend
  • RAM requirements
  • Environment (packages) on the workers
  • Monitoring: GTOP, Dask UI, Lambda UI


Start Small

Start Small


  • Split Troubleshooting between process and parallelization.
  • Run one single unit of atomic parallelization.
  • Scale in sequence.
  • Scale in Parallel - Smally: small number of units, small number of workers.
  • Benchmark
  • Estimate pricing

Performance