Foundations of Parallel Computing

Session Overview

  • What is parallel computing
  • Units of parallelization
  • How it builds on Module 1 Foundation
  • MapReduce
  • Why Map Reduce changed the world
  • The two key types of Parallel Computing

Parallel Computing

In performant computing, seconds matter…


Big Data is just a lot of Small Data


The corollary: To solve big data problems, make them small data problems.

Culture


“I don’t care about performance, I just want to get the job done…”




“If you don’t care about performance, you won’t get your job done.”

Performance:

  • Is Time.
  • Is Money.
  • Is CO2
  • Makes the impossible possible.

Parallel Computing

  • Simultaneous use of processing (compute) on a computational problem.

  • Many Hands make Light Work

Parallel Terminology

  • Parallel Computing
  • Concurrent Computing
  • Distributed Computing

Parallel Ingredients

  1. Compute
    • Central Processing Unit (CPU)
    • Graphics Processing Unit (GPU)
    • Cores, Processors.
  2. Memory
    • For data in current use
    • Random Access Memory (RAM)
  3. Storage
    • Where raw data are stored pre and post use
    • S3 & EBS
  4. Networking
    • Data transfer between components
    • Faster is better


One of these will be a limiting factor on your performance.

Parallel Tuning

  • Can adjust each of Compute, Memory, Storage, Networking almost independently.

  • Right-size resources for work.

Units of Parallelization

Atomic Unit of Parallelization:

  • Your experimental Unit
  • Generally independent
  • Small
  • Mappable


The INNER loop of your nested for loops.

Atomic Units of Parallelization


Average daily precipitation?


7 day rolling average of temperature?


County mortality index?

Session Overview

  • What is parallel computing
  • Units of parallelization
  • How it builds on Module 1 Foundation
  • MapReduce
  • Why Map Reduce changed the world
  • The two key types of Parallel Computing

A bit of history…

  • Shared memory multiprocessors
  • Clusters
  • Your Computer
  • Your Cloud


… the challenge: programming in parallel.

Map Reduce

Map Reduce

Communications of the Association for Computing Machinery

Why MapReduce changed the world…

Common mental model and framework for big data tasks.


  1. Split a big data problem into a small data problem.
  2. General and flexible enough to adapt to new problems.
  3. Allowed for larger-than-memory processing.
  4. Scales incredibly well
    • Domains
    • Size
  5. Move compute to data, not data to compute.


All modern digital technologies you use today are built on the map reduce paradigm.

… And how it is going to change yours.

MapReduce was originally Java in Hadoop


… but now it is everywhere.

  • Split-Apply-Combine
  • Groupby-Aggregate
  • Chunk + Map

… including in Python.

Session Overview

  • What is parallel computing
  • Units of parallelization
  • How it builds on Module 1 Foundation
  • MapReduce
  • Why Map Reduce changed the world
  • The two key types of Parallel Computing

The two key types of Parallel Computing

  • Single Machine
  • Cluster

Note: For each of these options, you can scale Compute, Memory, Storage, and Networking more or less independently.


… And we will be exploring both of these in the next module…