Using Cromwell with MMBatch
This step-by-step guide will deploy a Cromwell setup with AWS Batch as compute resource.
Preparation
MMAB
The Memory Machine AWS Batch Engien (MMAB) allows for central configuration and visibility of distributed Memory Machine Engines.
You should have a small instance running with the mmab
service installed and started.
To enable the snapshot for all AWS Batch jobs connect to the API and issue the following command.
curl -sk -X PUT http://localhost:8080/api/v1/ckptConfig \
-H "Content-Type: application/json" \
-d '{"ckptMode":"iterative","ckptImagePath":"/mmc-checkpoint","ckptInterval":120000000000,"rootFSDiff":true,"diagnosisMode":true,"ckptOnSigTerm":true}'|jq .
Cognito integration
In case you use Amazon Cognito to log into the management console you would need to pass (and refresh) the token to access the API. We are going to update the documentation to reflect this use-case.
Deploy CloudFormation stack
To get started we'll deploy a CloudFormation template (CFT) to setup
- AWS Batch queue with two compute environments (spot, on-demand)
- cromwell-specific bucket
- necessary IAM roles to run AWS Batch jobs.
To get started, go to the CloudFormation Dashboard.
Home | Stack Template |
---|---|
Hit Create Stack
on the home screen and keep the defaults Choose an existing template
as well as Amazon S3 URL
.
Copy&paste the following URL on the next screen (screenshot on the right):
https://cromwell-aws-cloudformation-templates.s3.eu-west-1.amazonaws.com/root-templates/gwfcore-root.template.yaml
Next enter stack details
S3 Bucket Name
: Paste the bucket name you want to be used.Existing Bucket?
: In case you did not prepare a bucket before, chooseNO
.
Stack Details 1 | Stack Details 1 Example |
---|---|
Network details
Stack Details 2 | Stack Details 2 Example |
---|---|
Hit Next
on the bottom of the page. We keep the defaults for the rest of the details.
EFS Filesystem
We need a file-system to enable all instances to access the checkpoints made by the MemoryMachine Engine. Thus, we'll enable EFS creation within the template.
It is going to be mounted under /mnt/efs
.
Stack Options
Acknowledge the capabilties and hit Next
.
Info
Make sure to select Preserve successfully provisioned resources
as we expect a failure to happen which we need to mitigate later.
Preserve Resources | Acknowledge resource |
---|---|
Review and Submit
After you reviewed your stack setup, hit Submit
.
Within the next screen you can follow along in the Timeline view
. Even though the CodeStack
did not complete we are good to go.
Set Parameter
Because of the failing task we need to create a parameter ourselves. Please go to the Parameter Store.
Create a new parameter.
Name
: Should corespond with your stack name/gwfcore/${StackName}/installed-artifacts/s3-root-url
.Value
:s3://cromwell-aws-cloudformation-templates/artifacts
Security Group Adjustments
not enough?
Need to allow Anywhere-IP4
Your MMAB instance needs to be accessible (via https) from the worker nodes. Please add the newly created Security Group <stackName>-BatchStack-XYZ
to the inbound rule for the mmab instance.
Launch Template Adjustments
To enable MMEngine we need to install the Memory Machine Engine on instances which are started within the AWS Batch environment. This is done by adjusting the LaunchTemplates for the Compute Resources.
Head over to EC2 - LaunchTemplates and mark the (most likely) first in the list and create a new version.
Add Keypair
If you want to log into the worker node to have a look around please add a ssh keypair.
Scroll down to Advanced details
and expand.
Within the expanded section scroll down to the very end.
Add the following install snippet just before the # enable ecs , docker and autoscaling.
section.
Please replace $address
with the IP address of your mmab
instance.
After you safed the new version, clock the lt_ID
at the top of the confirmation page.
Set the default version to the latest one by using Set default version
.
Put Launch Template to work
Now that we changed the LaunchTemplate we need to apply it to the Compute Environment (CE) in AWS Batch.
The LaunchTemplate is applied (and copied into a CE LaunchTemplate) when a new CE is created.
Thus, head over to the Compute Enviroment Listing, select the spot queue, and hit Clone
.
Change the name (append -new
) and check the LaunchTemplate config on the next page. Since we set the default to the new version it will pick up the modified LauchTemplate.
CE Name | CE $Default Launch Template |
---|---|
Continue with the defaults until the finished the cloning process.
To finalise the modification we need to change the job queue to use the new CE. Head over to the Job Queue Listing, select the default queue and hit Edit
.
Connect the -new
Compute Environment and disconnect the old one.
Now we are done, copy the ARN of the job queue for the next step.
Example Workflow
The example workflow is going to print 30 iterations with 10 second delay inbetween.
workflow helloWorld {
String name
call sayHello { input: name=name }
}
task sayHello {
String name
command {
for i in $(seq 1 90); do
printf "[cromwell-say-hello] Iteration $i: hello to ${name} on $(date)\n"
sleep 10
done
}
output {
String out = read_string(stdout())
}
runtime {
docker: "archlinux:latest"
maxRetries: 3
}
}
The input file just holds a string:
The configuration file cromwell-batch-engine.conf
has some variables you need to replace:
$BUCKETNAME
with your work directory bucket (2x in the file)$queueArn
with the Arn of the queue to run the workflow$region
with the region of your S3 bucket
include required(classpath("application"))
aws {
application-name = "cromwell"
auths = [
{
name = "default"
scheme = "default"
}
]
region = "$region"
}
engine {
filesystems {
s3.auth = "default"
}
}
backend {
default = "AWSBatch"
providers {
AWSBatch {
actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
config {
numSubmitAttempts = 3
numCreateDefinitionAttempts = 3
auth = "default"
// Base bucket for workflow executions
root = "s3://$BUCKETNAME/cromwell-wd"
// A reference to an auth defined in the `aws` stanza at the top. This auth is used to create
// Jobs and manipulate auth JSONs.
default-runtime-attributes {
queueArn: "$queueArn"
scriptBucketName: "$BUCKETNAME"
}
filesystems {
s3 {
// A reference to a potentially different auth for manipulating files via engine functions.
auth = "default"
}
}
}
}
}
}
java -Dconfig.file=cromwell-batch-engine.conf -jar cromwell-87.jar run hello.wdl --inputs hello.json
After you ran the command above, you'll see a job appearing in the AWS Batch Dashboard.
A runnable job in a queue will trigger the AWS Batch Scheduler to act. Depending on the order of CEs the Scheduler will createa an Auto Scaling Group (ASG). You can see it happening by heading over to the EC2 - ASG Listing
Eventually an EC2 Spot instance will get started (EC2 Instance Listing)
Change the Job Definition
To pick up the checkpoints and we'll need to adjust the job definition by add the retry count.
After the new revision is created you can submit the job again. Now it will pick up hte changes and start checkpointing.
java -Dconfig.file=cromwell-batch-engine.conf -jar cromwell-87.jar run hello.wdl --inputs hello.json
Once the job runs it will create checkpoints under the BATCH_JOB_ID
:
$ ls /mmc-checkpoint/61756233-eef9-43c1-aa0c-f3b2b9e536d0/0/
dump.log inventory.img irmap-cache pagemap-1.img pagemap-61.img pagemap-62.img pagemap-63.img pagemap-88.img pages-1.img pages-2.img pages-3.img pages-4.img pages-5.img stats-dump
The job itself will output a for loop like this.
+ for i in $(seq 1 90)
++ date
+ printf '[cromwell-say-hello] Iteration 59: hello to Developer on Mon Jan 20 09:13:09 UTC 2025\n'
+ sleep 10
[cromwell-say-hello] Iteration 59: hello to Developer on Mon Jan 20 09:13:09 UTC 2025
Interrupt
Once the is running and checkpoints are created we can simulate a Spot interruption event to trigger a restore.
The MMEngine on the EC2 instance will catch the interruption message.
==> /var/log/memverge/pagent.log <==
time="2025-01-16T07:57:08.179Z" level=warning msg="the spot instance is going to be interrupted"
time="2025-01-16T07:57:08.179Z" level=warning msg="triggers final checkpoint for all containers"
Once the interruption has gone through, a new instance will be started to retry the job.
But instead of starting from scratch the MMAB will restore from the previously made checkpoint.
==> /var/log/memverge/mmrunc.log <==
{"level":"info","msg":"(JobID: 0aafd84f-c059-43d3-8a52-6eed4bd33313) Successfully restored container c743780e22cb3fee48f61bb303738d816f9f0d11110935764da4e151edc84731 from /mmc-checkpoint/0aafd84f-c059-43d3-8a52-6eed4bd33313","time":"2025-01-20T09:21:06Z"}
The log reflects, that the container did not start from scratch.
$ docker logs c743780e22cb
+ for i in $(seq 1 90)
++ date
+ printf '[cromwell-say-hello] Iteration 60: hello to Developer on Mon Jan 20 09:21:08 UTC 2025\n'
+ sleep 10
[cromwell-say-hello] Iteration 60: hello to Developer on Mon Jan 20 09:21:08 UTC 2025
Teardown
To remove the stack you need to remove the Security Group from the mmab-https
security group. Otherwise the delete will get stuck.