Serverless Multi-Worker Parallelization

Session Overview

Serverless Introduction
Lambda
Lithops
Modal

Serverless

Wait, you mean there are no servers?

Server-based

Serverless

Managed Services
Cloud Service Provider manages:
- Updating
- Provisioning
- Scaling (From Zero on Up)

Focus on the application, not the logistics

Serverless Offerings

Function as a service. (AWS Lambda)
Container Orchestration as a Service (AWS Fargate)
Database as a Service (AWS Aurora/DynamoDB)
Object Storage as a Service (AWS S3)

Serverless Advantages

Allows individuals or small teams to run infrastructure at scale.
Focus on domain logic, not infrastructure
Transfers undifferentiated heavy lifting
Massive scalability (beginning at zero)
Amazing (and cost effective) for spiky workloads

Serverless Disadvantages

Must fit problem to constraints of service
Extremely costly for constantly high workloads
Paradigm shift (Big Paradigm Shift)

All Things Distributed

Session Overview

Serverless Introduction
Lambda
Lithops
Modal

Lambda

Serverless Function as a Service from AWS

Write a function in Python* and AWS runs it for you.

or JavaScript, Java, Go, Rust, ….

Lambda

Runs functions less than 15min, and with less than 10Gb of RAM
Eminently scalable: default up to 1000 workers, can be more
Built to be event driven
Natively integrates with almost all AWS services
Very, very cheap

We are going to use Lambda to scale format conversion.

Lambda

Start small, then scale

Initialize Lambda Function
Add Lambda Layer
Write Lambda Code
Configure Lambda
- Timeout
- Memory
- Permissions
Test
Set up URL Endpoint
Test
Scale
Turn off Endpoint

Lambda Code

Also: GitLab Repo

import json
import awswrangler as wr
import logging
import sys
import os

# Set up logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
logger.addHandler(handler)

def lambda_handler(event, context):
    """
    Lambda_handler is called when the AWS Lambda
    is triggered. 
    """
    
    logger.info("******************Start AWS Lambda Handler************")
    logger.info("## ENVIRONMENT VARIABLES")
    logger.info(os.environ)
    logger.info("## EVENT")
    logger.info(event)
    
    input_path = event['input_path']
    output_path = event['output_path']
    
    logger.info(f"Reading in {input_path}")
    df = wr.s3.read_csv(input_path)
    
    wr.s3.to_parquet(
        df=df,
        path=output_path,
        dataset=True,
        partition_cols=['ELEMENT']
    )
    logger.info(f"Wrote {output_path}")

    # TODO implement
    return {
        'statusCode': 200,
        'body': json.dumps(f'Wrote {output_path}')
    }

Lambda Test Event

{
    "input_path": "s3://noaa-ghcn-pds/csv/by_year/1889.csv",
    "output_path": "s3://esds-lambda-example/dsw/year=1889/"
}

Serverless Introduction
Lambda
Lithops
Modal

Lithops

A framework for scientific serverless computing.

Automates configuration and deployment of lambda
Python Package
Also runs on other AWS Services
Works on OpenShift

Lithops

Format Conversion: Kerchunk

Kerchunk maps gridded array files into Zarr Like format.
Uses ‘sidecar’ json/parquet files to hold byte ranges.
Enables lazy loading of complete, massive NetCDF datasets.

json: JavaScript Object Notation. Essentially a Python Dict in file form with some stricter formatting rules.

Lithops

Configure Environment
- Install Lithops
- Set up credentials
- Connect to AWS Backends
Create single Kerchunk Sidecar Files
- Start small, scale up
Combine into single Kerchunk JSON file
Test reading in as Zarr

Serverless Introduction
Lambda
Lithops
Modal

Serverless Multi-Worker Parallelization

Session Overview

Serverless

Server-based

Serverless

Serverless

Serverless Offerings

Serverless Advantages

Serverless Disadvantages

Session Overview

Lambda

Lambda

Lambda

Lambda Code

Lambda Test Event

Lithops

Lithops

Lithops

Modal