Ingredients for a successful project:
Thinking about Products:
(EDA)
The first step in any project.
Initial investigation into the data
Check assumptions
Spot patterns
Find anomalies
Identify problems before investing time
Answer:
Are there gaps? Are there assumptions in the data that prevent me from answering my question? Do I need more data?
What challenges might we encounter using these data?
What are our data boundary conditions?
If, for example, we wanted to use these data for ML, where might we have to be careful applying/extending these models? (If we want to predict something in Alaska, but all our data are from the continental US, that could be a problem.)
Assess:
Quantity
Quality
Volume (How Much?)
Velocity (How fast/frequently?)
Variety (Formats, data types?)
Veracity (Do we trust these data?)
Complete (Missing data?)
Corrupt (Files ok?)
Data Patterns
R
Python
Both of these languages share common characteristics that make them successful and languages of choice for Data Science:
On-Premise
Cloud
Advantages
Disadvantages
Advantages
Disadvantages
Desirable Attributes in a Format:
Formats (Cloud Optimized1)
Let’s take a look…
Key Takeaways:
Earth System Data Science in the Cloud