Using Cromwell with MMBatch

This step-by-step guide will deploy a Cromwell setup with AWS Batch as compute resource.

Preparation

MMAB

The Memory Machine AWS Batch Engien (MMAB) allows for central configuration and visibility of distributed Memory Machine Engines. You should have a small instance running with the mmab service installed and started.

To enable the snapshot for all AWS Batch jobs connect to the API and issue the following command.

curl -sk -X PUT http://localhost:8080/api/v1/ckptConfig \
    -H "Content-Type: application/json" \
    -d '{"ckptMode":"iterative","ckptImagePath":"/mmc-checkpoint","ckptInterval":120000000000,"rootFSDiff":true,"diagnosisMode":true,"ckptOnSigTerm":true}'|jq .

Cognito integration

In case you use Amazon Cognito to log into the management console you would need to pass (and refresh) the token to access the API. We are going to update the documentation to reflect this use-case.

Deploy CloudFormation stack

To get started we'll deploy a CloudFormation template (CFT) to setup

AWS Batch queue with two compute environments (spot, on-demand)
cromwell-specific bucket
necessary IAM roles to run AWS Batch jobs.

To get started, go to the CloudFormation Dashboard.

Home	Stack Template

Hit Create Stack on the home screen and keep the defaults Choose an existing template as well as Amazon S3 URL.
Copy&paste the following URL on the next screen (screenshot on the right):

https://cromwell-aws-cloudformation-templates.s3.eu-west-1.amazonaws.com/root-templates/gwfcore-root.template.yaml

Next enter stack details

S3 Bucket Name: Paste the bucket name you want to be used.
Existing Bucket?: In case you did not prepare a bucket before, choose NO.

Stack Details 1	Stack Details 1 Example

Network details

Stack Details 2	Stack Details 2 Example

Hit Next on the bottom of the page. We keep the defaults for the rest of the details.

EFS Filesystem

We need a file-system to enable all instances to access the checkpoints made by the MemoryMachine Engine. Thus, we'll enable EFS creation within the template.
It is going to be mounted under /mnt/efs.

cf EFS

Stack Options

Acknowledge the capabilties and hit Next.

Info

Make sure to select Preserve successfully provisioned resources as we expect a failure to happen which we need to mitigate later.

Preserve Resources	Acknowledge resource

Review and Submit

After you reviewed your stack setup, hit Submit.

Within the next screen you can follow along in the Timeline view. Even though the CodeStack did not complete we are good to go.

cf stack step0

Set Parameter

Because of the failing task we need to create a parameter ourselves. Please go to the Parameter Store.

Parameter Store Listing

Create a new parameter.

Name: Should corespond with your stack name /gwfcore/${StackName}/installed-artifacts/s3-root-url.
Value: s3://cromwell-aws-cloudformation-templates/artifacts

Parameter Store Create

Security Group Adjustments

not enough?

Need to allow Anywhere-IP4

Your MMAB instance needs to be accessible (via https) from the worker nodes. Please add the newly created Security Group <stackName>-BatchStack-XYZ to the inbound rule for the mmab instance.

EFS Create

Launch Template Adjustments

To enable MMEngine we need to install the Memory Machine Engine on instances which are started within the AWS Batch environment. This is done by adjusting the LaunchTemplates for the Compute Resources.

Head over to EC2 - LaunchTemplates and mark the (most likely) first in the list and create a new version.

launch_template_listing

Add Keypair

If you want to log into the worker node to have a look around please add a ssh keypair.

launch template keypair

Scroll down to Advanced details and expand.

launch_template_collapsed_advanced

Within the expanded section scroll down to the very end.

launch_template_user_data

Add the following install snippet just before the # enable ecs , docker and autoscaling. section. Please replace $address with the IP address of your mmab instance.

# Install mmcloud batch engine
- curl -k https://$address/api/v1/scripts/install-pagent | bash

After you safed the new version, clock the lt_ID at the top of the confirmation page.

launch_template_modified

Set the default version to the latest one by using Set default version.

launch_template set default

Put Launch Template to work

Now that we changed the LaunchTemplate we need to apply it to the Compute Environment (CE) in AWS Batch. The LaunchTemplate is applied (and copied into a CE LaunchTemplate) when a new CE is created. Thus, head over to the Compute Enviroment Listing, select the spot queue, and hit Clone.

Compute Environment Listing

Change the name (append -new) and check the LaunchTemplate config on the next page. Since we set the default to the new version it will pick up the modified LauchTemplate.

CE Name	CE `$Default` Launch Template

Continue with the defaults until the finished the cloning process.

To finalise the modification we need to change the job queue to use the new CE. Head over to the Job Queue Listing, select the default queue and hit Edit.

Job Queue Listing

Connect the -new Compute Environment and disconnect the old one.

Job Queue Compute Environemnt Change

Now we are done, copy the ARN of the job queue for the next step.

Job Queue Compute Environemnt Change

Example Workflow

The example workflow is going to print 30 iterations with 10 second delay inbetween.

hello.wdl

workflow helloWorld {
    String name
    call sayHello { input: name=name }
}
task sayHello {
    String name
    command {
        for i in $(seq 1 90); do
            printf "[cromwell-say-hello] Iteration $i: hello to ${name} on $(date)\n"
            sleep 10
        done
    }
    output {
        String out = read_string(stdout())
    }
    runtime {
        docker: "archlinux:latest"
        maxRetries: 3
    }
}

The input file just holds a string:

hello.json

{
    "helloWorld.name": "Developer"
}

The configuration file cromwell-batch-engine.conf has some variables you need to replace:

$BUCKETNAME with your work directory bucket (2x in the file)
$queueArn with the Arn of the queue to run the workflow
$region with the region of your S3 bucket

cromwell-batch-engine.conf

include required(classpath("application"))
aws {
  application-name = "cromwell"
  auths = [
    {
      name = "default"
      scheme = "default"
    }
  ]
  region = "$region"
}

engine {
  filesystems {
    s3.auth = "default"
  }
}

backend {
  default = "AWSBatch"
  providers {
    AWSBatch {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        numSubmitAttempts = 3
        numCreateDefinitionAttempts = 3
        auth = "default"
        // Base bucket for workflow executions
        root = "s3://$BUCKETNAME/cromwell-wd"
        // A reference to an auth defined in the `aws` stanza at the top.  This auth is used to create
        // Jobs and manipulate auth JSONs.
        default-runtime-attributes {
          queueArn: "$queueArn"
          scriptBucketName: "$BUCKETNAME"
        }
        filesystems {
          s3 {
            // A reference to a potentially different auth for manipulating files via engine functions.
            auth = "default"
          }
        }
      }
    }
  }
}

java -Dconfig.file=cromwell-batch-engine.conf -jar cromwell-87.jar run hello.wdl --inputs hello.json

After you ran the command above, you'll see a job appearing in the AWS Batch Dashboard.

AWS Batch Dashboard

A runnable job in a queue will trigger the AWS Batch Scheduler to act. Depending on the order of CEs the Scheduler will createa an Auto Scaling Group (ASG). You can see it happening by heading over to the EC2 - ASG Listing

EC2 Autoscaling Group

Eventually an EC2 Spot instance will get started (EC2 Instance Listing)

EC2 Autoscaling Group

Change the Job Definition

To pick up the checkpoints and we'll need to adjust the job definition by add the retry count.

Job Definitions Retries

After the new revision is created you can submit the job again. Now it will pick up hte changes and start checkpointing.

java -Dconfig.file=cromwell-batch-engine.conf -jar cromwell-87.jar run hello.wdl --inputs hello.json

Once the job runs it will create checkpoints under the BATCH_JOB_ID:

$ ls /mmc-checkpoint/61756233-eef9-43c1-aa0c-f3b2b9e536d0/0/
dump.log  inventory.img  irmap-cache  pagemap-1.img  pagemap-61.img  pagemap-62.img  pagemap-63.img  pagemap-88.img  pages-1.img  pages-2.img  pages-3.img  pages-4.img  pages-5.img  stats-dump

The job itself will output a for loop like this.

+ for i in $(seq 1 90)
++ date
+ printf '[cromwell-say-hello] Iteration 59: hello to Developer on Mon Jan 20 09:13:09 UTC 2025\n'
+ sleep 10
[cromwell-say-hello] Iteration 59: hello to Developer on Mon Jan 20 09:13:09 UTC 2025

Interrupt

Once the is running and checkpoints are created we can simulate a Spot interruption event to trigger a restore.

The MMEngine on the EC2 instance will catch the interruption message.

==> /var/log/memverge/pagent.log <==
time="2025-01-16T07:57:08.179Z" level=warning msg="the spot instance is going to be interrupted"
time="2025-01-16T07:57:08.179Z" level=warning msg="triggers final checkpoint for all containers"

Once the interruption has gone through, a new instance will be started to retry the job.

But instead of starting from scratch the MMAB will restore from the previously made checkpoint.

==> /var/log/memverge/mmrunc.log <==
{"level":"info","msg":"(JobID: 0aafd84f-c059-43d3-8a52-6eed4bd33313) Successfully restored container c743780e22cb3fee48f61bb303738d816f9f0d11110935764da4e151edc84731 from /mmc-checkpoint/0aafd84f-c059-43d3-8a52-6eed4bd33313","time":"2025-01-20T09:21:06Z"}

The log reflects, that the container did not start from scratch.

$ docker logs c743780e22cb
+ for i in $(seq 1 90)
++ date
+ printf '[cromwell-say-hello] Iteration 60: hello to Developer on Mon Jan 20 09:21:08 UTC 2025\n'
+ sleep 10
[cromwell-say-hello] Iteration 60: hello to Developer on Mon Jan 20 09:21:08 UTC 2025

Teardown

To remove the stack you need to remove the Security Group from the mmab-https security group. Otherwise the delete will get stuck.