Nextflow on AWS with FSx for Lustre workDir

Highlights

🌟 Highly performant handling of 10s of TBs of work directory data on Amazon FSx for Lustre.
Significant cost savings achieved by the use of Spot instances with snapshots placed on a s3 bucket using JuiceFS.
Time savings on input data movement by the use of s3fs to mount input bucket(s) as POSIX compatible storage(s).
⚠️ A good estimate of the maximum work directory size and throughput is needed in advance as FSx for Lustre is not elastic. It can scale upwards but the scaling can take a couple of hours to a few days. If a reliable estimate is not possible, consider using an Amazon Elastic File System (EFS) instead of FSx for Lustre.

Overview

This stack deploys the entire Nextflow workflow within a single availability zone which eliminates the zone-in/zone-out data transfer costs. The pipeline is launched from a head compute instance which then uses the nf-float plugin to orchestrate the workflow through MM™ Cloud's grid executor. The input data bucket is mounted as POSIX storage using s3fs-fuse which makes the bucket behave as a local file system on the head and worker compute instances. The Nextflow work directory is placed on a FSx for Lustre. The snapshots for the spot compute instances are placed on a separate S3 bucket. OpCentre attaches this bucket to the spot instances using highly performant JuiceFS to store and restore instance state whenever an instance is reclaimed. The outputs are placed on a S3 bucket through Nextflow's built-in S3 API.

Nextflow on AWS with FSx

Nextflow deployed on MM™ Cloud with work directory placed on a FSx for Lustre within a single availability zone.

Pre-requisites

Requirement 1: MM™ Cloud Deployment and User Credentials

Deployment: Before starting this guide, please make sure that the MM™ Cloud is already deployed in your account. If not, please see the guide for default VPC deployment or the guide for the non-default VPC deployment.
Credentials: Keep your MM™ Cloud OpCentre user credentials (username and password) at hand. This user must NOT be the default admin user that comes pre-loaded with the MM™ Cloud OpCentre. To avoid exposing the MM™ Cloud credentials in setup scripts and configuration files, it is a best-practice to store them in a secure location such as ~/float/.secrets. This file should look like,

~/float/.secrets

MMC_IP=<OpCentre IP address>
MMC_USER=<OpCentre username>
MMC_PASSWORD=<OpCentre password>

Nextflow access denied exceptions!

Using the admin account from MM™ Cloud OpCentre can cause Nextflow to fail with access denial exceptions. This is because when admin account is used, the worker instances run with root user and the Nextflow head job fails to read files which have restricted access, unless the head job itself is run with the root user. It is a best-practice to not use the admin account for launching pipelines.

Requirement 2: Input, Output and Snapshots Buckets

Input Bucket: The input data should be placed in an input data bucket in the same region where the MM™ Cloud is deployed. This bucket should be registered as a read-only S3FS storage in the MM™ Cloud OpCentre using the admin account so that the storage is available to all the users. To register the bucket, click on Storage from inside the MM™ Cloud OpCentre and then click Register Storage. Select the Storage Type as S3 and Access Mode as Read Only. For a bucket named s3://my-input-bucket, enter my-input-bucket as Name, /my-input-bucket as Mount Point and s3://my-input-bucket as Bucket. There is no need to enter Access Key and Secret. If the OpCentre is deployed in the us-east-1 region, enter s3.us-east-1.amazonaws.com under Endpoint. For all other regions, leave the Endpoint blank.
Output Bucket: A separate bucket should be used for pipeline outputs as both read and write access is needed to this bucket. There is no need to register it as a storage as Nextflow can publish the outputs through its built-in s3 API in an asynchronous manner.
Snapshots Bucket: A separate bucket should be used for snapshots. To configure the OpCentre to use it for storing snapshots, open System Settings and, then, Cloud. Under Snapshot Location, select the radio button for Cloud Storage and enter the S3 URI of the bucket. The URI is the bucket address which looks like, s3://my-snapshots-bucket.

Requirement 3: FSx for Lustre

Deploy a FSx for Lustre in the same region and availability zone as the MM™ Cloud OpCentre. To do so, search for FSx from the AWS management console. Click Create File System and select Amazon FSx for Lustre. Following properties of the file system need to be configured appropriately,

Name: An appropriate name. For example, nextflow-workdir-fsx
Storage class: As Nextflow work directory only contains intermediary process files, the appropriate storage class is Scratch, SSD. This class offers the lowest cost/GB and also satisfies all the requisite technical requirements.
Storage capacity: This is the most challenging aspect of this stack. A very good estimate of the storage capacity must be made before deploying the FSx.
Data compression: LZ4 is recommended. Choose None if the pipeline is well built and all the process files are zipped. In some cases, intermediary files are first written to the disk and then zipped. In such cases, the data compression provided by FSx can significantly reduce the peak storage usage.
VPC: The same VPC where the MM™ Cloud OpCentre is deployed.
Security Group: A security group which allows all outbound traffic and following inbound traffic,

Type	Protocol	Port range	Source
Custom TCP	TCP	1018 - 1023	Self
Custom TCP	TCP	988	Self

Subnet: A subnet in the same availability zone as the MM™ Cloud OpCentre.
Leave the remaining settings with their default values.

Determining FSx storage and throughput capacity

Determinants of storage capacity are,

Peak total size of all the files in the work directory.
The maximum number of Nextflow tasks that will run in parallel. A larger number of tasks mean higher I/O throughput requirement which is tied to the storage capacity.

Here is a method for determining the storage capacity,

Deploy a 1 TB+ FSx or an EFS.
Run the pipeline with the largest possible sample.
Monitor the storage and throughput usage in the FSx/EFS console.
Take the maximum value of the storage usage and multiply with the number of samples that will be run in parallel.
Take the maximum value of the throughput usage and multiply with the number of samples that will be run in parallel. Very short lived (< 5min ) peaks in the throughput usage can be ignored at this step.
Add a 20% safeguard to the estimated storage and throughput.
Choose a storage capacity which satisfies both the storage and throughput estimates.
Run half of the maximum number of samples that you plan to run in parallel, monitor the storage and throughput usage and make sure that these statistics are in line with your estimate. Otherwise, revise your estimate.

Requirement 4: FSx for Lustre Registration

Similar to the input and output buckets, the FSx for Lustre storage should also be registered in the MM™ Cloud OpCentre. To register the FSx for Lustre, click on Storage from inside the MM™ Cloud OpCentre and then click Register Storage. Select the Storage Type as Lustre and Mount Point as /mnt/fsx (or your preferred mount point). Enter the name of the FSx for Lustre in the Name field. Set the URL to lustre://<DNS name>/<Mount name>. Both the DNS name and Mount name are available on the FSx console page once the file system has been created. Leave the Mount Options blank.

Pipeline launch

To start a pipeline, login into MM™ Cloud with your user credentials.

float login \
    -a $(sed -n 's/MMC_IP=\(.*\)/\1/p' ~/float/.secrets) \
    -u $(sed -n 's/MMC_USER=\(.*\)/\1/p' ~/float/.secrets) \
    -p $(sed -n 's/MMC_PASSWORD=\(.*\)/\1/p' ~/float/.secrets)

$(sed -n 's/MMC_PASSWORD=$.*$/\1/p' ~/float/.secrets) extracts the MM™ Cloud OpCentre password from the secrets file ~/float/.secrets. This avoid exposing the secret and also eliminates the need to retype it later. Once the login is successful, make sure that the password is also stored as a secret in the float secrets with name OPCENTER_PASSWORD. This can be checked and configured as,

float secrets ls # To list the secrets

float secret \
    set OPCENTER_PASSWORD \
    $(sed -n 's/OPCENTER_PASSWORD=\(.*\)/\1/p' ~/float/.secrets) # To set the OPCENTER_PASSWORD if missing

Step 2: Nextflow Head Compute Instance

Launch a head compute instance from where the Nextflow pipeline will be launched,

float submit -n <job_name> \
    --template nextflow:jfs \
    -c 2 -m 4 \
    --containerInit https://raw.githubusercontent.com/MemVerge/nf-float/refs/tags/0.4.7/shell/container-init-nextflow-fsx-efs.sh \
    --subnet subnet-<fsx_subnet_id> \
    --securityGroup sg-<fsx_security_group> \
    --storage <opcentre_input_data_bucket_name> \
    --storage <opcentre_fsx_name> \
    --env FSX_MOUNT_PATH=/mnt/fsx

Once the head instance is launched, it will appear by its name in the MM™ Cloud OpCentre Jobs console. Click on it to find its Public IP. Once its status changes to Executing, it will try to initialize the storages and its runtime container. You can monitor the progress of the initialization process from the logs listed under the Attachments tab.

Once the head instance has initialized fully, its SSH key will be available as a secret in float secrets. The key can be identified by the Job ID. Store the SSH key in a secure place and use it to login into the head instance.

float secrets ls # To list the secrets

float secret get <Head job ID>_SSHKEY > /path/to/ssh/key.pem

chmod 600 /path/to/ssh/key.pem

ssh -i /path/to/ssh/key.pem <MM™ Cloud username>@<Head job IP>

cd /mnt/fsx # Switch directory to FSx for Lustre
mkdir nextflow-test # Create a project specific directory
cd nextflow-test # Switch to the project specific directory

Step 3: Configuration

Following is a minimal configuration to successfully run Nextflow on MM™ Cloud. Create this file in the project work directory /mnt/fsx/nextflow-test.

mmc.config

plugins {
    id 'nf-float'
}

process {
    executor       = 'float'
}

float {
    address        = '<MM™ Cloud IP address>'
    commonExtra    = [
        '--vmPolicy [spotOnly=true]',
        '--storage <Input data bucket name registered in OpCentre storage, e.g. my-input-bucket>',
        '--subnet subnet-<ID of subnet in which the FSx for Lustre is deployed>',
        '--securityGroup sg-<ID of the security group attached with FSx for Lustre>',
        '--storage nextflow-workdir-test'
    ].join(' ')
}

The key components of this configuration are,

nf-float: This plugin handles job submission and status monitoring. Full documentation is available on its GitHub repo.
commonExtra: The commonExtra scope defines extra parameters for the float executor. See nf-float GitHub page for documentation on how to correctly apply configuration in different situations and float submit documentation for a comprehensive list of parameters. Here we have used,
- --vmPolicy: spotOnly=true tells float to only allow Spot instances for executing Nextflow processes.
- --storage: This parameter can be used to mount the storage(s) registered in the OpCentre so that input data bucket and work directory FSx for Lustre are available to all the instances running the Nextflow processes.
- --subnet: This tells float to launch the instance in a specific subnet.
- --securityGroup: This tells float to attach a security group to the compute instances so that they can access the FSx for Lustre.
float.address: This IP address tells float where to find the OpCentre for job submission.

Step 4: Launch

tmux should be used so that the Nextflow head process can continue even if the user logouts or the SSH connection drops.

tmux # To start a new tmux session

cd /mnt/fsx/nextflow-test # Make sure to launch the pipeline from a directory on the FSx

# Set the nextflow assets directory to a location on the FSx so that
# teh pipeline bin/ scripts are available to all the worker nodes
export NXF_ASSETS="/mnt/fsx/.nextflow/assets"

nextflow run nf-core/rnaseq \
    -r '3.18.0' \
    -profile test \
    -c mmc.config \
    -resume \
    --outdir s3://path/to/output/directory