Serverless Multi-Worker Parallelization

Session Overview

  • Serverless Introduction
  • Lambda
  • Lithops
  • Modal

Serverless


Wait, you mean there are no servers?

Server-based

Serverless

Serverless

  • Managed Services
  • Cloud Service Provider manages:
    • Updating
    • Provisioning
    • Scaling (From Zero on Up)

Focus on the application, not the logistics

Serverless Offerings

  • Function as a service. (AWS Lambda)
  • Container Orchestration as a Service (AWS Fargate)
  • Database as a Service (AWS Aurora/DynamoDB)
  • Object Storage as a Service (AWS S3)

Serverless Advantages

  • Allows individuals or small teams to run infrastructure at scale.
  • Focus on domain logic, not infrastructure
  • Transfers undifferentiated heavy lifting
  • Massive scalability (beginning at zero)
  • Amazing (and cost effective) for spiky workloads

Serverless Disadvantages

  • Must fit problem to constraints of service
  • Extremely costly for constantly high workloads
  • Paradigm shift (Big Paradigm Shift)


All Things Distributed

Session Overview

  • Serverless Introduction
  • Lambda
  • Lithops
  • Modal

Lambda

  • Serverless Function as a Service from AWS


Write a function in Python* and AWS runs it for you.


  • or JavaScript, Java, Go, Rust, ….

Lambda

  • Runs functions less than 15min, and with less than 10Gb of RAM
  • Eminently scalable: default up to 1000 workers, can be more
  • Built to be event driven
  • Natively integrates with almost all AWS services
  • Very, very cheap


We are going to use Lambda to scale format conversion.

Lambda

Start small, then scale

  1. Initialize Lambda Function
  2. Add Lambda Layer
  3. Write Lambda Code
  4. Configure Lambda
    • Timeout
    • Memory
    • Permissions
  5. Test
  6. Set up URL Endpoint
  7. Test
  8. Scale
  9. Turn off Endpoint

Lambda Code

Also: GitLab Repo

import json
import awswrangler as wr
import logging
import sys
import os

# Set up logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
logger.addHandler(handler)

def lambda_handler(event, context):
    """
    Lambda_handler is called when the AWS Lambda
    is triggered. 
    """
    
    logger.info("******************Start AWS Lambda Handler************")
    logger.info("## ENVIRONMENT VARIABLES")
    logger.info(os.environ)
    logger.info("## EVENT")
    logger.info(event)
    
    input_path = event['input_path']
    output_path = event['output_path']
    
    logger.info(f"Reading in {input_path}")
    df = wr.s3.read_csv(input_path)
    
    wr.s3.to_parquet(
        df=df,
        path=output_path,
        dataset=True,
        partition_cols=['ELEMENT']
    )
    logger.info(f"Wrote {output_path}")

    # TODO implement
    return {
        'statusCode': 200,
        'body': json.dumps(f'Wrote {output_path}')
    }

Lambda Test Event

{
    "input_path": "s3://noaa-ghcn-pds/csv/by_year/1889.csv",
    "output_path": "s3://esds-lambda-example/dsw/year=1889/"
}
  • Serverless Introduction
  • Lambda
  • Lithops
  • Modal

Lithops

A framework for scientific serverless computing.

  • Automates configuration and deployment of lambda
  • Python Package
  • Also runs on other AWS Services
  • Works on OpenShift

Lithops

Format Conversion: Kerchunk

  • Kerchunk maps gridded array files into Zarr Like format.
  • Uses ‘sidecar’ json/parquet files to hold byte ranges.
  • Enables lazy loading of complete, massive NetCDF datasets.


json: JavaScript Object Notation. Essentially a Python Dict in file form with some stricter formatting rules.

Lithops

  1. Configure Environment
    • Install Lithops
    • Set up credentials
    • Connect to AWS Backends
  2. Create single Kerchunk Sidecar Files
    • Start small, scale up
  3. Combine into single Kerchunk JSON file
  4. Test reading in as Zarr
  • Serverless Introduction
  • Lambda
  • Lithops
  • Modal