Skip to content

Using miniWDL with MMBatch

Deploy miniWDL Environment

We'll use the terraform stack for AWS to deploy an environment to run miniWDL in.

Install Terraform

brew install terraform

Please find the install instructions here

Apply Terraform

After terraform is installed we are going to clone the upstream repository and initialize terraform.

git clone https://github.com/miniwdl-ext/miniwdl-aws-terraform.git
cd miniwdl-aws-terraform
terraform init
$ git clone https://github.com/miniwdl-ext/miniwdl-aws-terraform.git
Klone nach 'miniwdl-aws-terraform'...
remote: Enumerating objects: 78, done.
remote: Counting objects: 100% (78/78), done.
remote: Compressing objects: 100% (54/54), done.
remote: Total 78 (delta 43), reused 54 (delta 24), pack-reused 0 (from 0)
Empfange Objekte: 100% (78/78), 20.31 KiB | 990.00 KiB/s, fertig.
Löse Unterschiede auf: 100% (43/43), fertig.
$ cd miniwdl-aws-terraform
$ terraform init

Initializing the backend...

Initializing provider plugins...
- Finding latest version of hashicorp/aws...
- Finding latest version of hashicorp/cloudinit...
- Installing hashicorp/aws v5.84.0...
- Installed hashicorp/aws v5.84.0 (signed by HashiCorp)
- Installing hashicorp/cloudinit v2.3.5...
- Installed hashicorp/cloudinit v2.3.5 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
(base) M1BookPro   miniwdl-aws-terraform git:(main)

Adjust Stack

Cloud Init script

Add a new variable to the file variables.tf.

variable "api_address" {
    description = "URl to the Management Server"
    default     = "http://<IP_ADDRESS>:<PORT>"
}

The script in main.tf has a section that imports a script.

data "cloudinit_config" "task" {
    gzip = false

    # enable EC2 Instance Connect for troubleshooting (if security group allows inbound SSH)
    part {
        content_type = "text/x-shellscript"
        content      = "yum install -y ec2-instance-connect"
    }
    part {
        content_type = "text/x-shellscript"
        content      = file("${path.module}/assets/init_docker_instance_storage.sh")
    }
}

To be able to access the DNS name of the EFS volume within the script we inline the script, instead of just importing it.

data "cloudinit_config" "task" {
    gzip = false

    # enable EC2 Instance Connect for troubleshooting (if security group allows inbound SSH)
    part {
        content_type = "text/x-shellscript"
        content      = "yum install -y ec2-instance-connect"
    }

    part {
        content_type = "text/x-shellscript"
        content      = <<-EOT
        #!/bin/bash
        # To run on first boot of an EC2 instance with NVMe instance storage volumes:
        # 1) Assembles them into a RAID0 array, formats with XFS, and mounts to /mnt/scratch
        # 2) Replaces /var/lib/docker with a symlink to /mnt/scratch/docker so that docker images and
        #    container file systems use this high-performance scratch space. (restarts docker)
        # The configuration persists through reboots (but not instance stop).
        # logs go to /var/log/cloud-init-output.log
        # refs:
        # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html
        # https://github.com/kislyuk/aegea/blob/master/aegea/rootfs.skel/usr/bin/aegea-format-ephemeral-storage
        set -euxo pipefail
        shopt -s nullglob
        mkdir -p /mnt/scratch/tmp
        systemctl stop docker || true
        if [ -d /var/lib/docker ] && [ ! -L /var/lib/docker ]; then
        mv /var/lib/docker /mnt/scratch
        fi
        mkdir -p /mnt/scratch/docker
        ln -s /mnt/scratch/docker /var/lib/docker
        # Create checkpoint dir
        mkdir -p /mmc-checkpoint
        # Mount EFS filesystem
        mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport ${aws_efs_file_system.efs.dns_name}:/ /mmc-checkpoint
        ## create a subdir to rebount under the subdir
        mkdir -p /mmc-checkpoint/checkpoints
        umount /mmc-checkpoint
        mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport ${aws_efs_file_system.efs.dns_name}:/checkpoints /mmc-checkpoint
        # Install MM batch engine
        curl -k ${var.api_address}/api/v1/scripts/install-pagent | bash
        systemctl restart docker || true
        systemctl restart --no-block ecs
        EOT
    }
}
Add SSH KeyPair

In case you want to log into a worker node you'll need to add key_name="<key_pair_name>" to the resource "aws_launch_template" "task" within the main.tf file.

Please add this line at the end of the script

key_name ="KEY_PAIR_NAME"

And look out for this line within main.tf.

# Uncomment to open SSH to task worker instances via EC2 Instance Connect (for troubleshooting)

resource "aws_launch_template" "task" {
name                   = "${var.environment_tag}-task"
update_default_version = true
iam_instance_profile {
    name = aws_iam_instance_profile.task.name
}
key_name ="KEY_PAIR_NAME"
block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
    volume_type = "gp3"
    volume_size = 40
    # ^ Large docker images may need more root EBS volume space on worker instances
    }
}
user_data = data.cloudinit_config.task.rendered
}

The SSH section within the security group looks like this.

# Uncomment to open SSH to task worker instances via EC2 Instance Connect (for troubleshooting)
ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
}
Existing VPC

If you want to use existing VPC, replace the network resources in main.tf with your existing VPC, subnets and security groups. While doing this, please make sure EFS port(2049) is open to the worker nodes.

How is that done?

Service Role

Warning

The stack will deploy a role AWSServiceRoleForEC2Spot which can only be created once globaly.
If you already have this role created, you need to set the following value to false in the variables.tf file.

variable "create_spot_service_roles" {
  description = "Create account-wide spot service roles (disable if they already exist)"
  type        = bool
  default     = false
}

Apply

Apply the terraform template with your owner_tag and s3_upload_buckets. Note here we use miniwdl as the environment tag which affects the resources names created by this templated, such as compute environment names, queue names and launch template names.

Pass the following variable.

  • environment_tag=miniwdl: tag that will be used for all resources created
  • owner_tag=me@example.com: tag to identify the owner of the resources
  • s3upload_buckets=["MY-BUCKET"]: Please use a bucket in the region you are deploying the stack in
  • -var='api_address=http://WW.XX.YY.ZZ:8080': IP address of the managmement server
terraform apply \
    -var='environment_tag=miniwdl' \
    -var='owner_tag=me@example.com' \
    -var='s3upload_buckets=["MY-BUCKET"]' \
    -var='api_address=http://WW.XX.YY.ZZ:8080'

Once applied, you should see output like this.

Apply complete! Resources: 12 added, 0 changed, 0 destroyed.

Outputs:

fs = "fs-03XYZ1"
fsap = "fsap-0aXYZ1"
security_group = "sg-0XYZ1"
subnets = [
  "subnet-0eXYZ1",
  "subnet-0aXYZ2",
  "subnet-00XYZ3",
]
workflow_queue = "miniwdl-workflow"

Create Management Server

Run MiniWDL

Install

pip3 install miniwdl-aws

MiniWDL Plugin

MiniWDL does not use AWS Batch job attemps usually. Instead it will create a new AWS Batch job if the job fails. To enable checkpoint/restore using MMBatch we created a miniWDL plugin which assigns a environment variable for each job that persists - so that a job retry can be identified even though the AWS Batch Job ID changed (b/c it is a new job).

To use the plugin, please use our public image. Either by environment variable or flag to miniwdl-aws.

export MINIWDL__AWS__WORKFLOW_IMAGE=memverge/miniwdl-mmab:0.0.2
--image=memverge/miniwdl-mmab:0.0.2

Test Env

Let's create a simple WDL workflow.

hello.wdl
workflow helloWorld {
    String name
    call sayHello { input: name=name }
}

task sayHello {
    String name
    command {
        for i in $(seq 1 30); do
            printf "# Iteration $i: hello to ${name} on $(date)\n"
            sleep 10
        done
    }
    output {
        String out = read_string(stdout())
    }
    runtime {
        docker: "archlinux:latest"
        maxRetries: 3
    }
}

Submit workflow

export MINIWDL__AWS__WORKFLOW_IMAGE=memverge/miniwdl-mmab:0.0.2
miniwdl-aws-submit --verbose --follow --no-cache hello.wdl --workflow-queue miniwdl-workflow name=world
miniwdl-aws-submit --image=memverge/miniwdl-mmab:0.0.2 --verbose --follow --no-cache hello.wdl --workflow-queue miniwdl-workflow name=world 
$ miniwdl-aws-submit hello.wdl --workflow-queue miniwdl-workflow name=world
2025-01-23 11:30:16.978 miniwdl-zip hello.wdl <= /Users/kniepbert/data/temp/memverge/miniwdl/hello.wdl
2025-01-23 11:30:16.979 miniwdl-zip Prepare archive /var/folders/tg/x8qd961x4xq98g35631w4t0r0000gn/T/miniwdl_zip_jz__13sj/hello.wdl.zip from directory /var/folders/tg/x8qd961x4xq98g35631w4t0r0000gn/T/miniwdl_zip_rdt43szb
2025-01-23 11:30:16.980 miniwdl-zip Move archive to destination /var/folders/tg/x8qd961x4xq98g35631w4t0r0000gn/T/tmp5wqw7sno/hello.wdl.zip