Using miniWDL with MMBatch

Configure MMBatch Management Server

The MMS should be configured like this:

"ckptInterval":120000000000: creates a checkpoint every 2min. You might want to increase this in production to 15min.
"ckptOnSigTerm":false: This features is used in K8s deployment.

For more details please consider the Config Reference.

curl -sk -X PUT http://localhost:8080/api/v1/ckptConfig \
   -H "Content-Type: application/json" \
   -d '{"ckptMode":"iterative","ckptImagePath":"/mmc-checkpoint","ckptInterval":120000000000,"rootFSDiff":true,"diagnosisMode":true,"ckptOnSigTerm":false}'|jq .

Deploy miniWDL Environment

We'll use the terraform stack for AWS to deploy an environment to run miniWDL in.

Install Terraform

MacOSLinux

brew install terraform

Please find the install instructions here

Apply Terraform

After terraform is installed we are going to clone the upstream repository and initialize terraform.

CMDstdout

git clone -b memverge-changes https://github.com/ChristianKniep/miniwdl-aws-terraform
cd miniwdl-aws-terraform
terraform init

$ git clone https://github.com/miniwdl-ext/miniwdl-aws-terraform.git
Klone nach 'miniwdl-aws-terraform'...
remote: Enumerating objects: 78, done.
remote: Counting objects: 100% (78/78), done.
remote: Compressing objects: 100% (54/54), done.
remote: Total 78 (delta 43), reused 54 (delta 24), pack-reused 0 (from 0)
Empfange Objekte: 100% (78/78), 20.31 KiB | 990.00 KiB/s, fertig.
Löse Unterschiede auf: 100% (43/43), fertig.
$ cd miniwdl-aws-terraform
$ terraform init

Initializing the backend...

Initializing provider plugins...
- Finding latest version of hashicorp/aws...
- Finding latest version of hashicorp/cloudinit...
- Installing hashicorp/aws v5.84.0...
- Installed hashicorp/aws v5.84.0 (signed by HashiCorp)
- Installing hashicorp/cloudinit v2.3.5...
- Installed hashicorp/cloudinit v2.3.5 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
(base) M1BookPro ➜  miniwdl-aws-terraform git:(main)

Apply

Apply the terraform template with your owner_tag and s3_upload_buckets. Note here we use miniwdl as the environment tag which affects the resources names created by this templated, such as compute environment names, queue names and launch template names.

Pass the following variable.

environment_tag=miniwdl: tag that will be used for all resources created
owner_tag=me@example.com: tag to identify the owner of the resources
ssh_keyname=<keyname> (optional): if you like to ssh into the worker nodes, please provide a keypair name that is available in the region.
s3upload_buckets=["MY-BUCKET"]: Please use a bucket in the region you are deploying the stack in
-var='api_address=http://WW.XX.YY.ZZ:8080': IP address of the managmement server

terraform apply \
    -var='environment_tag=miniwdl' \
    -var='owner_tag=me@example.com' \
    -var='ssh_keyname=my-keyname' \
    -var='s3upload_buckets=["MY-BUCKET"]' \
    -var='mmab_server=http://WW.XX.YY.ZZ:8080'

Once applied, you should see output like this.

Apply complete! Resources: 12 added, 0 changed, 0 destroyed.

Outputs:

fs = "fs-03XYZ1"
fsap = "fsap-0aXYZ1"
security_group = "sg-0XYZ1"
subnets = [
  "subnet-0eXYZ1",
  "subnet-0aXYZ2",
  "subnet-00XYZ3",
]
workflow_queue = "miniwdl-workflow"

Run MiniWDL

Install

PIP

pip3 install miniwdl-aws

MiniWDL Plugin

MiniWDL does not use AWS Batch job attemps usually. Instead it will create a new AWS Batch job if the job fails. To enable checkpoint/restore using MMBatch we created a miniWDL plugin which assigns a environment variable for each job that persists - so that a job retry can be identified even though the AWS Batch Job ID changed (b/c it is a new job).

To use the plugin, please use our public image. Either by environment variable or flag to miniwdl-aws.

Environment VariableFlag

export MINIWDL__AWS__WORKFLOW_IMAGE=memverge/miniwdl-mmab:0.0.2

--image=memverge/miniwdl-mmab:0.0.2

Test Env

Let's create a simple WDL workflow.

hello.wdl

workflow helloWorld {
    String name
    call sayHello { input: name=name }
}

task sayHello {
    String name
    command {
        for i in $(seq 1 30); do
            printf "# Iteration $i: hello to ${name} on $(date)\n"
            sleep 10
        done
    }
    output {
        String out = read_string(stdout())
    }
    runtime {
        docker: "archlinux:latest"
        maxRetries: 3
    }
}

Submit workflow

Environment VariableFlag

export MINIWDL__AWS__WORKFLOW_IMAGE=memverge/miniwdl-mmab:0.0.2
miniwdl-aws-submit --verbose --follow --no-cache hello.wdl --workflow-queue miniwdl-workflow name=world

miniwdl-aws-submit --image=memverge/miniwdl-mmab:0.0.2 --verbose --follow --no-cache hello.wdl --workflow-queue miniwdl-workflow name=world

$ miniwdl-aws-submit hello.wdl --workflow-queue miniwdl-workflow name=world
2025-01-23 11:30:16.978 miniwdl-zip hello.wdl <= /Users/kniepbert/data/temp/memverge/miniwdl/hello.wdl
2025-01-23 11:30:16.979 miniwdl-zip Prepare archive /var/folders/tg/x8qd961x4xq98g35631w4t0r0000gn/T/miniwdl_zip_jz__13sj/hello.wdl.zip from directory /var/folders/tg/x8qd961x4xq98g35631w4t0r0000gn/T/miniwdl_zip_rdt43szb
2025-01-23 11:30:16.980 miniwdl-zip Move archive to destination /var/folders/tg/x8qd961x4xq98g35631w4t0r0000gn/T/tmp5wqw7sno/hello.wdl.zip

Interrupt Instance

Once an instance is running and checkpoints are done, head over to the Spot Request Console and initiate an interruption.

FIS list

Create the interruption event.

FIS create

Within the /var/log/memverge/pagent.log log file you'll see the event being captured.

time="2025-02-04T12:04:04.23Z" level=warning msg="the spot instance is going to be interrupted"
time="2025-02-04T12:04:04.23Z" level=warning msg="triggers final checkpoint for all containers"

This will trigger the finalization of the checkpoint and a freeze to all processes within the container.

Result

You can observe what is going on by connecting to the worker instance, docker exec into the instance, and inspect the stdout.txt.

The 5min gap within the resulting stdout represents another instance restarting. Since the conrtainer is paused once the 2min spot warning is issued on the host, the first two minutes are lost in terms of walltime.

ResultInspect Instance/ContainerInspect Output

# Iteration 50: hello to world on Tue Feb  4 11:42:13 UTC 2025
# Iteration 51: hello to world on Tue Feb  4 11:47:45 UTC 2025

ssh -i key.pem -l ec2-user IP
docker ps
docker exec -ti <container_ID> -- bash
cat /mnt/efs/miniwdl_run/YYYYMMMDD_hash_helloWorld/call-sayHello/stdout.txt

$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/key -l ec2-user 52.221.253.165
Warning: Permanently added '52.221.253.165' (ED25519) to the list of known hosts.
,     #_
~\_  ####_
~~  \_#####\
~~     \###|
~~       \#/ ___   Amazon Linux 2023 (ECS Optimized)
~~       V~' '->
    ~~~         /
    ~~._.   _/
        _/ _/
    _/m/'

For documentation, visit http://aws.amazon.com/documentation/ecs
Last login: Tue Feb  4 11:46:32 2025 from 109.42.240.239
[ec2-user@ip-10-0-6-9 ~]$ docker ps
CONTAINER ID   IMAGE                            COMMAND                  CREATED         STATUS                   PORTS     NAMES
7666cb568fc7   archlinux:latest                 "/bin/bash -ec 'cd /…"   8 minutes ago   Up 8 minutes                       ecs-sayHello-usbqfpz5-1-default-aac4f4a8e3ffeec8e501
4f967d214787   amazon/amazon-ecs-agent:latest   "/agent"                 9 minutes ago   Up 9 minutes (healthy)             ecs-agent
[ec2-user@ip-10-0-6-9 ~]$ docker exec -ti 7666cb568fc7 bash
[root@ip-10-0-14-149 /]# cat /mnt/efs/miniwdl_run/
20250204_113345_helloWorld/ _CACHE/                     _LAST/
[root@ip-10-0-14-149 /]# cat /mnt/efs/miniwdl_run/20250204_113345_helloWorld/
call-sayHello/ inputs.json    workflow.log
[root@ip-10-0-14-149 /]# cat /mnt/efs/miniwdl_run/20250204_113345_helloWorld/call-sayHello/
awsBatchJobDetail.11753a0b-58fc-44ad-96fe-2e5f6b402a64.json  inputs.json                                                  task.log
awsBatchJobDetail.8526a0d1-a3e8-434a-a4e1-37dae3605b37.json  stderr.txt                                                   work/
command                                                      stdout.txt
[root@ip-10-0-14-149 /]# cat /mnt/efs/miniwdl_run/20250204_113345_helloWorld/call-sayHello/stdout.txt
# Iteration 1: hello to world on Tue Feb  4 11:34:03 UTC 2025
*snip*
# Iteration 49: hello to world on Tue Feb  4 11:42:03 UTC 2025
# Iteration 50: hello to world on Tue Feb  4 11:42:13 UTC 2025 <== here's the break
# Iteration 51: hello to world on Tue Feb  4 11:47:45 UTC 2025
# Iteration 52: hello to world on Tue Feb  4 11:47:55 UTC 2025