Nextflow on AWS with FSx for Lustre workDir
Highlights
- 🌟 Highly performant handling of 10s of TBs of work directory data on Amazon FSx for Lustre.
- Significant cost savings achieved by the use of Spot instances with snapshots placed on a s3 bucket using JuiceFS.
- Time savings on input data movement by the use of s3fs to mount input bucket(s) as POSIX compatible storage(s).
- ⚠️ A good estimate of the maximum work directory size and throughput is needed in advance as FSx for Lustre is not elastic. It can scale upwards but the scaling can take a couple of hours to a few days. If a reliable estimate is not possible, consider using an Amazon Elastic File System (EFS) instead of FSx for Lustre.
Overview
This stack deploys the entire Nextflow workflow within a single availability zone which eliminates the zone-in/zone-out data transfer costs. The pipeline is launched from a head compute instance which then uses the nf-float plugin to orchestrate the workflow through MM™ Cloud's grid executor. The input data bucket is mounted as POSIX storage using s3fs-fuse which makes the bucket behave as a local file system on the head and worker compute instances. The Nextflow work directory is placed on a FSx for Lustre. The snapshots for the spot compute instances are placed on a separate S3 bucket. OpCentre attaches this bucket to the spot instances using highly performant JuiceFS to store and restore instance state whenever an instance is reclaimed. The outputs are placed on a S3 bucket through Nextflow's built-in S3 API.
Nextflow deployed on MM™ Cloud with work directory placed on a FSx for Lustre within a single availability zone.
Pre-requisites
Requirement 1: MM™ Cloud Deployment and User Credentials
Deployment:
Before starting this guide, please make sure that the MM™ Cloud is already deployed in your account. If not, please see the guide for default VPC deployment or the guide for the non-default VPC deployment.Credentials:
Keep your MM™ Cloud OpCentre user credentials (username and password) at hand. This user must NOT be the defaultadmin
user that comes pre-loaded with the MM™ Cloud OpCentre. To avoid exposing the MM™ Cloud credentials in setup scripts and configuration files, it is a best-practice to store them in a secure location such as~/float/.secrets
. This file should look like,
MMC_IP=<OpCentre IP address>
MMC_USER=<OpCentre username>
MMC_PASSWORD=<OpCentre password>
Nextflow access denied exceptions!
Using the admin
account from MM™ Cloud OpCentre can cause Nextflow to fail with access denial exceptions. This is because when admin
account is used, the worker instances run with root
user and the Nextflow head job fails to read files which have restricted access, unless the head job itself is run with the root
user. It is a best-practice to not use the admin
account for launching pipelines.
Requirement 2: Input, Output and Snapshots Buckets
Input Bucket:
The input data should be placed in an input data bucket in the same region where the MM™ Cloud is deployed. This bucket should be registered as a read-only S3FS storage in the MM™ Cloud OpCentre using theadmin
account so that the storage is available to all the users. To register the bucket, click onStorage
from inside the MM™ Cloud OpCentre and then clickRegister Storage
. Select theStorage Type
as S3 andAccess Mode
asRead Only
. For a bucket nameds3://my-input-bucket
, entermy-input-bucket
asName
,/my-input-bucket
asMount Point
ands3://my-input-bucket
asBucket
. There is no need to enterAccess Key
andSecret
. If the OpCentre is deployed in theus-east-1
region, enters3.us-east-1.amazonaws.com
underEndpoint
. For all other regions, leave theEndpoint
blank.Output Bucket:
A separate bucket should be used for pipeline outputs as both read and write access is needed to this bucket. There is no need to register it as a storage as Nextflow can publish the outputs through its built-in s3 API in an asynchronous manner.Snapshots Bucket:
A separate bucket should be used for snapshots. To configure the OpCentre to use it for storing snapshots, openSystem Settings
and, then,Cloud
. UnderSnapshot Location
, select the radio button forCloud Storage
and enter the S3 URI of the bucket. The URI is the bucket address which looks like,s3://my-snapshots-bucket
.
Requirement 3: FSx for Lustre
Deploy a FSx for Lustre in the same region and availability zone as the MM™ Cloud OpCentre. To do so, search for FSx from the AWS management console. Click Create File System
and select Amazon FSx for Lustre
. Following properties of the file system need to be configured appropriately,
Name:
An appropriate name. For example,nextflow-workdir-fsx
Storage class:
As Nextflow work directory only contains intermediary process files, the appropriate storage class isScratch, SSD
. This class offers the lowest cost/GB and also satisfies all the requisite technical requirements.Storage capacity:
This is the most challenging aspect of this stack. A very good estimate of the storage capacity must be made before deploying the FSx.Data compression:
LZ4 is recommended. Choose None if the pipeline is well built and all the process files are zipped. In some cases, intermediary files are first written to the disk and then zipped. In such cases, the data compression provided by FSx can significantly reduce the peak storage usage.VPC:
The same VPC where the MM™ Cloud OpCentre is deployed.Security Group:
A security group which allows all outbound traffic and following inbound traffic,
Type | Protocol | Port range | Source |
---|---|---|---|
Custom TCP | TCP | 1018 - 1023 | Self |
Custom TCP | TCP | 988 | Self |
Subnet:
A subnet in the same availability zone as the MM™ Cloud OpCentre.Root Squash:
uid
andgid
of the root user on FSx for Lustre. By default FSx for Lustre is mounted with filesystem permissions of 644 which means that the root user has read and write permissions whereas the root user group and others have read-only permissions. These permissions do not allow non-root users to write to FSx. Therefore, it is important to match these IDs with the MM™ Cloud useruid
andgid
. MM™ Cloud user and group IDs can be found underUsers and Groups
in the OpCentre.- Leave the remaining settings with their default values.
Determining FSx storage and throughput capacity
Determinants of storage capacity are,
- Peak total size of all the files in the work directory.
- The maximum number of Nextflow tasks that will run in parallel. A larger number of tasks mean higher I/O throughput requirement which is tied to the storage capacity.
Here is a method for determining the storage capacity,
- Deploy a 1 TB+ FSx or an EFS.
- Run the pipeline with the largest possible sample.
- Monitor the storage and throughput usage in the FSx/EFS console.
- Take the maximum value of the storage usage and multiply with the number of samples that will be run in parallel.
- Take the maximum value of the throughput usage and multiply with the number of samples that will be run in parallel. Very short lived (< 5min ) peaks in the throughput usage can be ignored at this step.
- Add a 20% safeguard to the estimated storage and throughput.
- Choose a storage capacity which satisfies both the storage and throughput estimates.
- Run half of the maximum number of samples that you plan to run in parallel, monitor the storage and throughput usage and make sure that these statistics are in line with your estimate. Otherwise, revise your estimate.
Requirement 4: FSx for Lustre Registration
Similar to the input and output buckets, the FSx for Lustre storage should also be registered in the MM™ Cloud OpCentre. To register the FSx for Lustre, click on Storage
from inside the MM™ Cloud OpCentre and then click Register Storage
. Select the Storage Type
as Lustre and Mount Point
as /mnt/fsx
(or your preferred mount point). Enter the name of the FSx for Lustre in the Name
field. Set the URL
to lustre://<DNS name>/<Mount name>
. Both the DNS name
and Mount name
are available on the FSx console page once the file system has been created. Leave the Mount Options
blank.
Pipeline launch
Step 1: MM™ Cloud Login
To start a pipeline, login into MM™ Cloud with your user credentials.
float login \
-a $(sed -n 's/MMC_IP=\(.*\)/\1/p' ~/float/.secrets) \
-u $(sed -n 's/MMC_USER=\(.*\)/\1/p' ~/float/.secrets) \
-p $(sed -n 's/MMC_PASSWORD=\(.*\)/\1/p' ~/float/.secrets)
$(sed -n 's/MMC_PASSWORD=\(.*\)/\1/p' ~/float/.secrets)
extracts the MM™ Cloud OpCentre password from the secrets file ~/float/.secrets
. This avoid exposing the secret and also eliminates the need to retype it later. Once the login is successful, make sure that the password is also stored as a secret in the float secrets
with name OPCENTER_PASSWORD
. This can be checked and configured as,
float secrets ls # To list the secrets
float secret \
set OPCENTER_PASSWORD \
$(sed -n 's/OPCENTER_PASSWORD=\(.*\)/\1/p' ~/float/.secrets) # To set the OPCENTER_PASSWORD if missing
Step 2: Nextflow Head Compute Instance
Launch a head compute instance from where the Nextflow pipeline will be launched,
float submit -n <job_name> \
--template nextflow:jfs \
-c 2 -m 4 \
--containerInit https://mmce-data.s3.amazonaws.com/nextflow/container-init-nextflow-fsx-efs.sh \
--subnet subnet-<fsx_subnet_id> \
--securityGroup sg-<fsx_security_group> \
--storage <opcentre_input_data_bucket_name> \
--storage <opcentre_fsx_name>
Once the head instance is launched, it will appear by its name in the MM™ Cloud OpCentre Jobs
console. Click on it to find its Public IP
. Once its status changes to Executing
, it will try to initialize the storages and its runtime container. You can monitor the progress of the initialization process from the logs listed under the Attachments
tab.
Once the head instance has initialized fully, its SSH key will be available as a secret in float secrets. The key can be identified by the Job ID. Store the SSH key in a secure place and use it to login into the head instance.
float secrets ls # To list the secrets
float secret get <Head job ID>_SSHKEY > /path/to/ssh/key.pem
chmod 600 /path/to/ssh/key.pem
ssh -i /path/to/ssh/key.pem <MM™ Cloud username>@<Head job IP>
cd /mnt/fsx # Switch directory to FSx for Lustre
mkdir nextflow-test # Create a project specific directory
cd nextflow-test # Switch to the project specific directory
Step 3: Configuration
Following is a minimal configuration to successfully run Nextflow on MM™ Cloud. Create this file in the project work directory /mnt/fsx/nextflow-test
.
plugins {
id 'nf-float'
}
process {
executor = 'float'
}
float {
address = '<MM™ Cloud IP address>'
commonExtra = [
'--vmPolicy [spotOnly=true]',
'--storage <Input data bucket name registered in OpCentre storage, e.g. my-input-bucket>',
'--subnet subnet-<ID of subnet in which the FSx for Lustre is deployed>',
'--securityGroup sg-<ID of the security group attached with FSx for Lustre>',
'--storage nextflow-workdir-test'
].join(' ')
}
The key components of this configuration are,
nf-float:
This plugin handles job submission and status monitoring. Full documentation is available on its GitHub repo.commonExtra:
ThecommonExtra
scope defines extra parameters for the float executor. See nf-float GitHub page for documentation on how to correctly apply configuration in different situations andfloat submit
documentation for a comprehensive list of parameters. Here we have used,--vmPolicy:
spotOnly=true
tells float to only allow Spot instances for executing Nextflow processes.--storage:
This parameter can be used to mount the storage(s) registered in the OpCentre so that input data bucket and work directory FSx for Lustre are available to all the instances running the Nextflow processes.--subnet:
This tells float to launch the instance in a specific subnet.--securityGroup:
This tells float to attach a security group to the compute instances so that they can access the FSx for Lustre.
float.address:
This IP address tells float where to find the OpCentre for job submission.
Step 4: Launch
tmux should be used so that the Nextflow head process can continue even if the user logouts or the SSH connection drops.