Using miniWDL with MMBatch
Configure MMBatch Management Server
The MMS should be configured like this:
"ckptInterval":120000000000
: creates a checkpoint every 2min. You might want to increase this in production to 15min."ckptOnSigTerm":false
: This features is used in K8s deployment.
For more details please consider the Config Reference.
curl -sk -X PUT http://localhost:8080/api/v1/ckptConfig \
-H "Content-Type: application/json" \
-d '{"ckptMode":"iterative","ckptImagePath":"/mmc-checkpoint","ckptInterval":120000000000,"rootFSDiff":true,"diagnosisMode":true,"ckptOnSigTerm":false}'|jq .
Deploy miniWDL Environment
We'll use the terraform stack for AWS to deploy an environment to run miniWDL in.
Install Terraform
Please find the install instructions here
Apply Terraform
After terraform
is installed we are going to clone the upstream repository and initialize terraform.
$ git clone https://github.com/miniwdl-ext/miniwdl-aws-terraform.git
Klone nach 'miniwdl-aws-terraform'...
remote: Enumerating objects: 78, done.
remote: Counting objects: 100% (78/78), done.
remote: Compressing objects: 100% (54/54), done.
remote: Total 78 (delta 43), reused 54 (delta 24), pack-reused 0 (from 0)
Empfange Objekte: 100% (78/78), 20.31 KiB | 990.00 KiB/s, fertig.
Löse Unterschiede auf: 100% (43/43), fertig.
$ cd miniwdl-aws-terraform
$ terraform init
Initializing the backend...
Initializing provider plugins...
- Finding latest version of hashicorp/aws...
- Finding latest version of hashicorp/cloudinit...
- Installing hashicorp/aws v5.84.0...
- Installed hashicorp/aws v5.84.0 (signed by HashiCorp)
- Installing hashicorp/cloudinit v2.3.5...
- Installed hashicorp/cloudinit v2.3.5 (signed by HashiCorp)
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
(base) M1BookPro ➜ miniwdl-aws-terraform git:(main)
Apply
Apply the terraform
template with your owner_tag
and s3_upload_buckets
. Note here we use miniwdl
as the environment tag which affects the resources names created by this templated, such as compute environment names, queue names and launch template names.
Pass the following variable.
environment_tag=miniwdl
: tag that will be used for all resources createdowner_tag=me@example.com
: tag to identify the owner of the resourcesssh_keyname=<keyname>
(optional): if you like to ssh into the worker nodes, please provide a keypair name that is available in the region.s3upload_buckets=["MY-BUCKET"]
: Please use a bucket in the region you are deploying the stack in-var='api_address=http://WW.XX.YY.ZZ:8080'
: IP address of the managmement server
terraform apply \
-var='environment_tag=miniwdl' \
-var='owner_tag=me@example.com' \
-var='ssh_keyname=my-keyname' \
-var='s3upload_buckets=["MY-BUCKET"]' \
-var='mmab_server=http://WW.XX.YY.ZZ:8080'
Once applied, you should see output like this.
Apply complete! Resources: 12 added, 0 changed, 0 destroyed.
Outputs:
fs = "fs-03XYZ1"
fsap = "fsap-0aXYZ1"
security_group = "sg-0XYZ1"
subnets = [
"subnet-0eXYZ1",
"subnet-0aXYZ2",
"subnet-00XYZ3",
]
workflow_queue = "miniwdl-workflow"
Run MiniWDL
Install
MiniWDL Plugin
MiniWDL does not use AWS Batch job attemps usually. Instead it will create a new AWS Batch job if the job fails. To enable checkpoint/restore using MMBatch we created a miniWDL plugin which assigns a environment variable for each job that persists - so that a job retry can be identified even though the AWS Batch Job ID changed (b/c it is a new job).
To use the plugin, please use our public image. Either by environment variable or flag to miniwdl-aws
.
Test Env
Let's create a simple WDL workflow.
workflow helloWorld {
String name
call sayHello { input: name=name }
}
task sayHello {
String name
command {
for i in $(seq 1 30); do
printf "# Iteration $i: hello to ${name} on $(date)\n"
sleep 10
done
}
output {
String out = read_string(stdout())
}
runtime {
docker: "archlinux:latest"
maxRetries: 3
}
}
Submit workflow
$ miniwdl-aws-submit hello.wdl --workflow-queue miniwdl-workflow name=world
2025-01-23 11:30:16.978 miniwdl-zip hello.wdl <= /Users/kniepbert/data/temp/memverge/miniwdl/hello.wdl
2025-01-23 11:30:16.979 miniwdl-zip Prepare archive /var/folders/tg/x8qd961x4xq98g35631w4t0r0000gn/T/miniwdl_zip_jz__13sj/hello.wdl.zip from directory /var/folders/tg/x8qd961x4xq98g35631w4t0r0000gn/T/miniwdl_zip_rdt43szb
2025-01-23 11:30:16.980 miniwdl-zip Move archive to destination /var/folders/tg/x8qd961x4xq98g35631w4t0r0000gn/T/tmp5wqw7sno/hello.wdl.zip
Interrupt Instance
Once an instance is running and checkpoints are done, head over to the Spot Request Console and initiate an interruption.
Create the interruption event.
Within the /var/log/memverge/pagent.log
log file you'll see the event being captured.
time="2025-02-04T12:04:04.23Z" level=warning msg="the spot instance is going to be interrupted"
time="2025-02-04T12:04:04.23Z" level=warning msg="triggers final checkpoint for all containers"
This will trigger the finalization of the checkpoint and a freeze to all processes within the container.
Result
You can observe what is going on by connecting to the worker instance, docker exec
into the instance, and inspect the stdout.txt
.
The 5min gap within the resulting stdout represents another instance restarting. Since the conrtainer is paused once the 2min spot warning is issued on the host, the first two minutes are lost in terms of walltime.
$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/key -l ec2-user 52.221.253.165
Warning: Permanently added '52.221.253.165' (ED25519) to the list of known hosts.
, #_
~\_ ####_
~~ \_#####\
~~ \###|
~~ \#/ ___ Amazon Linux 2023 (ECS Optimized)
~~ V~' '->
~~~ /
~~._. _/
_/ _/
_/m/'
For documentation, visit http://aws.amazon.com/documentation/ecs
Last login: Tue Feb 4 11:46:32 2025 from 109.42.240.239
[ec2-user@ip-10-0-6-9 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7666cb568fc7 archlinux:latest "/bin/bash -ec 'cd /…" 8 minutes ago Up 8 minutes ecs-sayHello-usbqfpz5-1-default-aac4f4a8e3ffeec8e501
4f967d214787 amazon/amazon-ecs-agent:latest "/agent" 9 minutes ago Up 9 minutes (healthy) ecs-agent
[ec2-user@ip-10-0-6-9 ~]$ docker exec -ti 7666cb568fc7 bash
[root@ip-10-0-14-149 /]# cat /mnt/efs/miniwdl_run/
20250204_113345_helloWorld/ _CACHE/ _LAST/
[root@ip-10-0-14-149 /]# cat /mnt/efs/miniwdl_run/20250204_113345_helloWorld/
call-sayHello/ inputs.json workflow.log
[root@ip-10-0-14-149 /]# cat /mnt/efs/miniwdl_run/20250204_113345_helloWorld/call-sayHello/
awsBatchJobDetail.11753a0b-58fc-44ad-96fe-2e5f6b402a64.json inputs.json task.log
awsBatchJobDetail.8526a0d1-a3e8-434a-a4e1-37dae3605b37.json stderr.txt work/
command stdout.txt
[root@ip-10-0-14-149 /]# cat /mnt/efs/miniwdl_run/20250204_113345_helloWorld/call-sayHello/stdout.txt
# Iteration 1: hello to world on Tue Feb 4 11:34:03 UTC 2025
*snip*
# Iteration 49: hello to world on Tue Feb 4 11:42:03 UTC 2025
# Iteration 50: hello to world on Tue Feb 4 11:42:13 UTC 2025 <== here's the break
# Iteration 51: hello to world on Tue Feb 4 11:47:45 UTC 2025
# Iteration 52: hello to world on Tue Feb 4 11:47:55 UTC 2025