Using Mamba to manage R dependencies on AWS Lambda

Using R in Lambda without a custom runtime and managing dependencies using Micromamba

In this post we explore an approach to running R on AWS Lambda that is fast and reliable, when we need specific system dependencies not available to CentOS 7, which is the base for AWS Lambda runtime using container images.

This article is structured in the format of a tutorial. It builds gradually upon increasingly more complex lambda code. If you are here just for the really juicy bits, you can skip to Installing R Packages with Micromamba, which is the core of this article.

Requirements

First off, coming in you should have a basic knowledge of:

  1. AWS IAM, AWS Lambda and AWS ECR

  2. Docker build and push

  3. How to operate the AWS Console

  4. Python and R

It is out of the scope of this article to dive deeper into these parts. We instead focus on the main problem of running R on AWS Lambda while managing dependencies with Mamba.

First things first

What we want to accomplish is running an R routine on AWS Lambda, not handle Lambda requests using an R script. That difference is fundamental.

The latter is a little more complex, since there is no official runtime or base container image and we would need to craft our own custom runtime for that. If that is absolutely necessary for you, there are a couple of projects I can refer you to:

  1. https://github.com/mdneuzerling/lambdr (container image)

  2. https://medium.com/bakdata/running-r-on-aws-lambda-9d40643551a6 (using lambda layers)

However, both require maintaining custom runtimes for R.

In the present article, it was opted to use a thin lambda layer written in python, which just passes event data as arguments to the R script that handles those arguments using the r-argparse package.

The downside of this approach is obvious: we bundle an entire language interpreter with the final image that we don’t actually use for the core logic. However, the upside of not having to maintain our own runtime, benefiting from updates issued by AWS and using lambda in a standard, supported way far outweighs the downside.

Since lambda has a maximum of 10GB for the final container image it will run, I would advise to use those other approaches only in case you do hit this limit. Otherwise, stick to the supported way of running code on AWS.

Handling lambda the python way

The first thing we can do is to simply handle a lambda request using python and a container image. We will need:

  1. A lambda_handler.py to handle lambda requests

  2. A Dockerfile describing the image

  3. An ECR repository to upload the image

  4. The lambda function

The code for this section can be found at https://github.com/gchamon/r-lambda-mamba-article/tree/simple-python-handler.

The simplest code you can deploy to AWS Lambda

The initial python code couldn’t be simpler:

def handler(event, context):
    print(event)

It’s just a single function that logs to console the contents of event.

To be able to run this on AWS, we need to create a container image using the official python 3.9 base image. For the sake of simplicity, we are going to add the entire project to ${LAMBDA_TASK_ROOT}:

FROM public.ecr.aws/lambda/python:3.9

# the lambda handler
COPY . ${LAMBDA_TASK_ROOT}

CMD ["lambda.handler"]

The AWS infrastructure

AWS IAM

We need first an AWS IAM user with programmatic access and Lambda and ECR permissions, so we can get the credentials using the aws cli that are necessary to log in to AWS ECR with docker:

Use the Access Key ID and Secret Access Key to configure your aws cli.

AWS ECR

We can then create the ECR repository, build the container image and push it to ECR. We have to make sure the ECR repository is populated before attempting to create the lambda function, otherwise it will fail.

We can then follow the steps provided to log in to AWS ECR using docker, build the image and push it to the private repository:

After following those steps you should have the image pushed to ECR:

AWS Lambda

Now we can create the actual lambda using the container image we pushed. Be sure to increase the function timeout to at least 10 seconds and the function memory to 1GB, so that it runs faster and with more memory headroom.

Testing the image with the default event should work and the lambda should print the event to the log output:

We can now use this base skeleton to start developing our core R script.

Running a simple R Script

Let's now focus on running a simple R script without any package dependency.

For this we will need:

  1. The R Runtime installed in the container image

  2. A simple R script that will just print something to the console

  3. To call the R script from the lambda handler in python

The code for this section can be found at https://github.com/gchamon/r-lambda-mamba-article/tree/simple-r-script.

Installing the runtime

We can refer to https://github.com/mdneuzerling/lambdr for instructions on how to install the actual R Runtime, since the project also uses a base AWS Lambda image.

As of writing, the current R Runtime is the version 4.2.1, so we will install that. We will also add two optimizations on top of what lambdr uses to install R, namely squashing system dependencies installation and the actual R Runtime installation into a single RUN, and clearing the yum cache. Note that bzip2 has also been added as we will need it later for micromamba:

FROM public.ecr.aws/lambda/python:3.9

ENV R_VERSION=4.2.1

RUN yum -y install 
      wget 
      git 
      tar 
      bzip2 
  && yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm 
  && wget https://cdn.rstudio.com/r/centos-7/pkgs/R-${R_VERSION}-1-1.x86_64.rpm 
  && yum -y install R-${R_VERSION}-1-1.x86_64.rpm 
  && rm R-${R_VERSION}-1-1.x86_64.rpm 
  && yum clean all 
  && rm -rf /var/cache/yum


ENV PATH="${PATH}:/opt/R/${R_VERSION}/bin/"

# the lambda handler
COPY . ${LAMBDA_TASK_ROOT}

CMD ["lambda.handler"]

You can test that the R Runtime installation was successful by building the image and running Rscript --version in a container. Note that for this we need to clear out the default entrypoint for the lambda image:

$ docker run --rm --entrypoint='' r-lambda-mamba Rscript --version
Rscript (R) version 4.2.1 (2022-06-23)

Hello World!

Let's print some stuff to the console. Just to make things slightly more interesting, we will get rid of that annoying [1] that is added before any string printed with print:

#!/usr/bin/env Rscript

echo <- function(string) {
  cat(paste0(string, "\n"))
}

echo("Hello world")

Running the script from the lambda handler

To execute system binaries from python, with or without a shell, we can use subprocess.run:

import subprocess


def handler(event, context):
    subprocess.run(["./hello-world.R"], check=True)

Note that we are running hello-world.R directly. That is because the file has execution permissions (chmod +x helloworld.R) and a shebang to indicate the appropriate runtime.

Experimenting with the code on AWS Lambda

Now it is time to pack it up and ship it. Following the push commands from the last section, we can update the AWS ECR repository with the new image containing the R Runtime:

The image is stating to get phat!

Now, lets publish a new version of the lambda and run it again so we can check if our world is properly greeted!

Installing R Packages with Micromamba

Now we can start adding package dependencies to our R Script skeleton.

We can illustrate this by installing and using r-argparse from Anaconda to handle input arguments in our R Script.

For this we will need:

  1. micromamba installed in our container image

  2. r-argparse installed with micromamba

  3. actual arguments to pass to the R Script

Source code for this section can be found here: https://github.com/gchamon/r-lambda-mamba-article/tree/add-micromamba.

Installing and using micromamba

Actually installing micromamba is pretty simple. We just have to download and unpack the self-contained binary. Micromamba is, according to its documentation, “a tiny version of mamba [written in] pure C++”.

Making micromamba AWS Lambda compatible

We will need to install it manually though, because we will need control over where it is deployed in the final image.

This is because the container image is not going to be run as is in AWS Lambda. The actual lambda runtime nukes certain folders from our final container image, like /home and /root so we can’t really rely on .bashrc being available from those locations to the lambda during execution. What we can do is to make sure everything we need is in /opt/micromamba which remains intact after being sent to AWS Lambda.

According to the docs, installing micromamba is as simple as downloading, unpacking and adding it to PATH. We will add two modifications to the original documented installation procedure, the first being what we described in the last paragraph, and the second is pinning the version of micromamba so we can get reproducible container image builds.

Then we must initialize the the environment for bash and copy the final .bashrc file to /opt/micromamba. We will use /opt/micromamba/env as the environment folder to ensure it remains intact during the lambda execution.

Putting it together

To make sure we can actually use the installed dependencies during R execution, let's also write a separate, self-contained dependencies.R file that will attempt to require the necessary dependencies and break the image build in case it fails to do so:

#!/usr/bin/env Rscript

require_package_or_stop <- function(package_to_require) {
  result <- suppressMessages(suppressWarnings(
    require(package_to_require, character.only = TRUE, quietly = TRUE)
  ))
  if (!result) {
    stop(paste0("Fail to load ", package_to_require))
  }
}

packages_to_require <- c(
  "argparse"
)

for (package_to_require in packages_to_require) {
  require_package_or_stop(package_to_require)
}

Putting it all together, what we need to add to the Dockerfile is as follows:

# install micromamba
ENV MICROMAMBA_VERSION=0.25.0
ENV MICROMAMBA_INSTALL_FOLDER=/opt/micromamba
ENV PATH="$MICROMAMBA_INSTALL_FOLDER/bin:${PATH}"

RUN mkdir --parents $MICROMAMBA_INSTALL_FOLDER 
    && curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/$MICROMAMBA_VERSION | 
        tar -xvj --directory $MICROMAMBA_INSTALL_FOLDER bin/micromamba 
    && micromamba shell init -s bash -p $MICROMAMBA_INSTALL_FOLDER/env 
    && echo micromamba activate >> ~/.bashrc 
    && cp ~/.bashrc $MICROMAMBA_INSTALL_FOLDER

# install R dependencies with micromamba
COPY dependencies.R ${LAMBDA_TASK_ROOT}
RUN source $MICROMAMBA_INSTALL_FOLDER/.bashrc 
    && micromamba install --channel anaconda --channel conda-forge --channel r 
        r-argparse 
    && micromamba clean --all 
    && Rscript "${LAMBDA_TASK_ROOT}/dependencies.R"

Note that we explicitly set the channels anaconda, conda-forge and r when installing with micromamba. This is because packages can require other packges as dependencies that are not necessarily in their original channel. Some core packages can be found in anaconda and conda-forge while many R packages are in the r channel. Explicitly declaring those channels to be used tells micromamba to search for packages and dependencies on all of those channels.

It is also important to observe that after installing packages with micromamba we should purge its cache to reduce the final image size. That is what micromamba clean --all does.

Testing the package installed with micromamba

We can now use r-argparse to in our R Script parse incoming arguments. For this, let’s greet with Hello! a variable number of subjects:

#!/usr/bin/env Rscript

source("dependencies.R")

argument_parser <- ArgumentParser()
argument_parser$add_argument("--names-to-greet", type = "character", nargs = "+")
args <- argument_parser$parse_args()

echo <- function(string) {
  cat(paste0(string, "\n"))
}

for (name_to_greet in args$names_to_greet){
  echo(paste0("Hello ", name_to_greet, "!"))
}

And for the lambda handler part, let's agree to expect the following event pattern:

{
  "names_to_greet": ["World", "Reader", "Lambda"]
}

Then we can pass those names to the R Script. Note that we also can't just pass -i so that the bash subprocess behaves as an interactive shell, which loads .bashrc automatically, because .bashrcis not in the home folder. Since we also don't have access and don't know the home folder beforehand, we need to source .bashrc explicitly, otherwise the mamba environment won't get activated, causing the packages to fail to load.

import subprocess
import os


def handler(event, context):
    subprocess.run(
        ["bash", "-c",
         (f"source {os.path.join(os.environ['MICROMAMBA_INSTALL_FOLDER'], '.bashrc')} && "
          "./hello-world.R"
          f" --names-to-greet {' '.join(event['names_to_greet'])}")],
        check=True
    )

After building and deploying the new image, we can test it with the sample event JSON:

Conclusion

Using this article as a starting point, the reader can build more interesting examples, using raster for instance, or machine learning packages without having to worry about compatibility between R Packages and system dependencies.

Interestingly, after adding micromamba and installing a single package with it, the final image gets real chonky:

This is probably because mamba (and incidentally, conda) requires the use not only of the explicitly installed packages and its immediate dependencies, but also the adjacent system dependencies.

This is the main tradeoff of this approach, we sacrifice image size for image build speed and package compatibility. This is what makes it possible to run up-to-date packages like r-raster with their correct dependencies, like GDAL, GEOS and PROJ, all of which have only their ancient versions available directly in CentOS 7.

(cover photo by David Clode on Unsplash)

The Digital Meadow logo
Subscribe to The Digital Meadow and never miss a post.
#aws#lambda#r#python#mamba#conda