Nextflow on AWS with EFS workDir

Highlights

🌟 Elastic pay as you go handling of work directory data on Amazon EFS.
🌟 A good estimate of the maximum work directory size and throughput is not needed. However, for large number of samples, a well defined project timeframe, and where a good estimate of work directory size/throughput is known, Amazon FSx for Lustre is the preferred and more cost efficient choice.
Significant cost savings achieved by the use of Spot instances with snapshots placed on a s3 bucket using JuiceFS.
Time savings on input data movement by the use of s3fs to mount input bucket(s) as POSIX compatible storage(s).

Overview

This stack deploys the entire Nextflow workflow within a single availability zone which eliminates the zone-in/zone-out data transfer costs. The pipeline is launched from a head compute instance which then uses the nf-float plugin to orchestrate the workflow through MM™ Cloud's grid executor. The input data bucket is mounted as POSIX storage using s3fs-fuse which makes the bucket behave as a local file system on the head and worker compute instances. The Nextflow work directory is placed on an EFS. The snapshots for the spot compute instances are placed on a separate S3 bucket. OpCentre attaches this bucket to the spot instances using highly performant JuiceFS to store and restore instance state whenever an instance is reclaimed. The outputs are placed on a S3 bucket through Nextflow's built-in S3 API.

Nextflow on AWS with EFS

Nextflow deployed on MM™ Cloud with work directory placed on an EFS within a single availability zone.

Pre-requisites

Requirement 1: MM™ Cloud Deployment and User Credentials

Deployment: Before starting this guide, please make sure that the MM™ Cloud is already deployed in your account. If not, please see the guide for default VPC deployment or the guide for the non-default VPC deployment.
Credentials: Keep your MM™ Cloud OpCentre user credentials (username and password) at hand. This user must NOT be the default admin user that comes pre-loaded with the MM™ Cloud OpCentre. To avoid exposing the MM™ Cloud credentials in setup scripts and configuration files, it is a best-practice to store them in a secure location such as ~/float/.secrets. This file should look like,

~/float/.secrets

MMC_IP=<OpCentre IP address>
MMC_USER=<OpCentre username>
MMC_PASSWORD=<OpCentre password>

Nextflow access denied exceptions!

Using the admin account from MM™ Cloud OpCentre can cause Nextflow to fail with access denial exceptions. This is because when admin account is used, the worker instances run with root user and the Nextflow head job fails to read files which have restricted access, unless the head job itself is run with the root user. It is a best-practice to not use the admin account for launching pipelines.

Requirement 2: Input, Output and Snapshots Buckets

Input Bucket: The input data should be placed in an input data bucket in the same region where the MM™ Cloud is deployed. This bucket should be registered as a read-only S3FS storage in the MM™ Cloud OpCentre using the admin account so that the storage is available to all the users. To register the bucket, click on Storage from inside the MM™ Cloud OpCentre and then click Register Storage. Select the Storage Type as S3 and Access Mode as Read Only. For a bucket named s3://my-input-bucket, enter my-input-bucket as Name, /my-input-bucket as Mount Point and s3://my-input-bucket as Bucket. There is no need to enter Access Key and Secret. If the OpCentre is deployed in the us-east-1 region, enter s3.us-east-1.amazonaws.com under Endpoint. For all other regions, leave the Endpoint blank.
Output Bucket: A separate bucket should be used for pipeline outputs as both read and write access is needed to this bucket. There is no need to register it as a storage as Nextflow can publish the outputs through its built-in s3 API in an asynchronous manner.
Snapshots Bucket: A separate bucket should be used for snapshots. To configure the OpCentre to use it for storing snapshots, open System Settings and, then, Cloud. Under Snapshot Location, select the radio button for Cloud Storage and enter the S3 URI of the bucket. The URI is the bucket address which looks like, s3://my-snapshots-bucket.

Requirement 3: EFS

Deploy a EFS in the same region and availability zone as the MM™ Cloud OpCentre. To do so, search for EFS from the AWS management console. Click Create File System. Following properties of the file system need to be configured appropriately,

Name: An appropriate name. For example, nextflow-workdir-efs
VPC: The same VPC where the MM™ Cloud OpCentre is deployed.

Click Customize to configure additional properties,

File system type: Choose One Zone to restrict the file system to a single zone. This eliminates zone-in/zone-out data transfer costs.
Availability Zone: The same zone where the MM™ Cloud OpCentre is deployed.
Automatic backups: Disable backus as this device will only be used as a scratch space.
Encryption: Enable or disable in accordance with your institute's security policy.
Throughput mode: Select Enhanced/Elastic as it is recommended when the throughput requirements are unpredictable.
Subnet ID: A subnet in the same availability zone as the MM™ Cloud OpCentre.
Security groups: A security group which allows all outbound traffic and following inbound traffic,

Type	Protocol	Port range	Source
NFS	TCP	2049	Self

Requirement 4: EFS Registration

Similar to the input and output buckets, the EFS storage should also be registered in the MM™ Cloud OpCentre. To register the EFS, click on Storage from inside the MM™ Cloud OpCentre and then click Register Storage. Select the Storage Type as NFS and Mount Point as /mnt/efs (or your preferred mount point). Enter the name of the EFS in the Name field. Set the URL to nfs://<DNS name>/. Note that / after <DNS name> is mandatory.

Pipeline launch

To start a pipeline, login into MM™ Cloud with your user credentials.

float login \
    -a $(sed -n 's/MMC_IP=\(.*\)/\1/p' ~/float/.secrets) \
    -u $(sed -n 's/MMC_USER=\(.*\)/\1/p' ~/float/.secrets) \
    -p $(sed -n 's/MMC_PASSWORD=\(.*\)/\1/p' ~/float/.secrets)

$(sed -n 's/MMC_PASSWORD=$.*$/\1/p' ~/float/.secrets) extracts the MM™ Cloud OpCentre password from the secrets file ~/float/.secrets. This avoid exposing the secret and also eliminates the need to retype it later. Once the login is successful, make sure that the password is also stored as a secret in the float secrets with name OPCENTER_PASSWORD. This can be checked and configured as,

float secrets ls # To list the secrets

float secret \
    set OPCENTER_PASSWORD \
    $(sed -n 's/OPCENTER_PASSWORD=\(.*\)/\1/p' ~/float/.secrets) # To set the OPCENTER_PASSWORD if missing

Step 2: Nextflow Head Compute Instance

Launch a head compute instance from where the Nextflow pipeline will be launched,

float submit -n <job_name> \
    --template nextflow:jfs \
    -c 2 -m 4 \
    --containerInit https://raw.githubusercontent.com/MemVerge/nf-float/refs/tags/0.4.7/shell/container-init-nextflow-fsx-efs.sh \
    --subnet subnet-<efs_subnet_id> \
    --securityGroup sg-<efs_security_group> \
    --storage <opcentre_input_data_bucket_name> \
    --storage <opcentre_efs_name> \
    --env EFS_MOUNT_PATH=/mnt/efs

Once the head instance is launched, it will appear by its name in the MM™ Cloud OpCentre Jobs console. Click on it to find its Public IP. Once its status changes to Executing, it will try to initialize the storages and its runtime container. You can monitor the progress of the initialization process from the logs listed under the Attachments tab.

Once the head instance has initialized fully, its SSH key will be available as a secret in float secrets. The key can be identified by the Job ID. Store the SSH key in a secure place and use it to login into the head instance.

float secrets ls # To list the secrets

float secret get <Head job ID>_SSHKEY > /path/to/ssh/key.pem

chmod 400 /path/to/ssh/key.pem

ssh -i /path/to/ssh/key.pem <MM™ Cloud username>@<Head job IP>

cd /mnt/efs # Switch directory to EFS for Lustre
mkdir nextflow-test # Create a project specific directory
cd nextflow-test # Switch to the project specific directory

Step 3: Configuration

Following is a minimal configuration to successfully run Nextflow on MM™ Cloud. Create this file in the project work directory /mnt/efs/nextflow-test.

mmc.config

plugins {
    id 'nf-float'
}

process {
    executor        = 'float'
}

float {
    address         = '<MM™ Cloud IP address>'
    commonExtra     = [
        '--vmPolicy [spotOnly=true]',
        '--storage <Input data bucket name registered in OpCentre storage, e.g. my-input-bucket>',
        '--subnet subnet-<ID of subnet in which the EFS is deployed>',
        '--securityGroup sg-<ID of the security group attached with EFS>',
        '--storage nextflow-workdir-efs'
    ].join(' ')
}

The key components of this configuration are,

nf-float: This plugin handles job submission and status monitoring. Full documentation is available on its GitHub repo.
commonExtra: The commonExtra scope defines extra parameters for the float executor. See nf-float GitHub page for documentation on how to correctly apply configuration in different situations and float submit documentation for a comprehensive list of parameters. Here we have used,
- --vmPolicy: spotOnly=true tells float to only allow Spot instances for executing Nextflow processes.
- --storage: This parameter can be used to mount the storage(s) registered in the OpCentre so that input data bucket and work directory EFS are available to all the instances running the Nextflow processes.
- --subnet: This tells float to launch the instance in a specific subnet.
- --securityGroup: This tells float to attach a security group to the compute instances so that they can access the EFS.
float.address: This IP address tells float where to find the OpCentre for job submission.

Step 4: Launch

tmux should be used so that the Nextflow head process can continue even if the user logouts or the SSH connection drops.

tmux # To start a new tmux session

cd /mnt/efs/nextflow-test # Make sure to launch the pipeline from a directory on the EFS

# Set the nextflow assets directory to a location on the EFS so that
# teh pipeline bin/ scripts are available to all the worker nodes
export NXF_ASSETS="/mnt/efs/.nextflow/assets"

nextflow run nf-core/rnaseq \
    -r '3.18.0' \
    -profile test \
    -c mmc.config \
    -resume \
    --outdir s3://path/to/output/directory