Yesterday:
Today:
AKA Grammar School
Clean data is the same as raw data, just in a different presentation.
If you can think through what you want to do, you can code much better.
You can use building blocks of grammar to build sentences, then paragraphs…
Order of Operations (Efficacy & Efficiency)
Communication…
Perspectives:
Tabular Data
Gridded Data
Image Data
Do These:
Before These:
How an algorithm scales with more inputs.
Accessing an indexed value:
Use
Dataframe/Array
Operations whenever possible.
Map Reduce
Object Oriented Programming
Functional Programming
Yes, And!
Using the best of both worlds: Mapping functions to inputs.
Select IQR from each Atomic Unit of Parallelization:
Or:
groups = {counties, values}
for i in rows(county_wide):
if county_wide.label is in groups:
Add Temp value to group
else
Add group and Temp value
for group in groups:
Q1 = first_quartile(group)
Q3 = third quartile(group)
for element in group_values:
if element is greater than Q1
And element is less than Q3:
KEEP
else
Discard
# Set parallel environment
cores/workers = 12
backend_connection = 'local'
our_pipeline_map = (
our_data
.select data
.unique observations
.filter spatio-temporally
.drop missing values # If needed/wanted
.groupby atomic unit of parallelization
.aggregate
apply functions: median, max, linear models...
)
our_result = compute(our_pipeline_map)
Earth System Data Science in the Cloud