Using Nextflow with MMCloud
Summary
Nextflow is a workflow manager used to run and manage computational pipelines such as those found in bioinformatics. Workflow managers make it easier to run complex analyses where there are a series of — sometimes interconnected — tasks, each of which may involve different software and dependencies.
Nextflow provides a framework for describing how a workflow must be executed and includes a CLI for issuing nextflow
commands. The execution environment for each task is described using the Nextflow DSL (Domain-Specific Language). In Nextflow terminology, each task is assigned to an "executor," which is a complete environment for running that step in the analysis.
By attaching a MemVerge-developed plugin to a workflow, Nextflow can use MMCloud as an "executor." From an MMCloud point of view, the execution task that Nextflow assigns to it is an independent job that it runs just like any other batch job. The Nextflow user enjoys the benefits that accrue because of using MMCloud, such as reduced costs and shorter execution times.
This document describes how to use Nextflow with MMCloud so that Nextflow can schedule one or more (or all) of the tasks in a workflow to run on MMCloud. Examples are used to demonstrate the principles; you can adapt and modify as needed to fit your workflow.
Configuration
A Nextflow workflow requires a working directory where temporary files are stored and where a process can access the output of an earlier step. When MMCloud is engaged as an executor, the OpCenter instantiates a Worker Node (a container running in its own virtual machine) for each step in the process pipeline. Every Worker Node and the Nextflow Host (the host where the nextflow binary is installed) must have access to the same working directory — for example, the working directory can be an NFS-mounted directory or an S3 bucket. The figure below shows a configuration where an NFS Server provides shared access to the working directory and also acts as repository for input data and the final output.
The Nextflow configuration file describes the environment for each executor. To use MMCloud as an executor, the configuration file must contain definitions for:
- Nextflow plugin (source code and documentation are available here)
- Working directory (directory where the Nextflow Host and all the Worker Nodes have r/w access)
- IP address of the OpCenter
- Login credentials for the OpCenter (login credentials can also be provided using environment variables or the OpCenter secret manager)
- Location of the shared directory if using an NFS server (not needed if using S3)
The operation of Nextflow using an S3 bucket as the working directory is shown in the following figure.
Operation
The Nextflow job file (a file with extension .nf) describes the workflow and specifies the executor for each process. When the user submits a job using the nextflow run
command (as shown in the figure), any process with executor defined as "float" is scheduled for the OpCenter. Combining information from the configuration file and the job file, the Nextflow plugin formulates a float sbatch
command string and submits the job to the OpCenter. This procedure is repeated for every task in the workflow that has "float" as the executor. Every Worker Node mounts the same working directory so that the Nextflow Host and all the Worker Nodes read from, and write to, the same shared directory.
Note
CSPs impose limits on services instantiated by each account. In AWS, these limits are called "service quotas" and apply to every AWS service, generally on a region by region basis. Some Nextflow pipelines instantiate a large number of compute instances, a number large enough to exceed the AWS EC2 service quota. If this happens, increase your AWS EC2 service quota and rerun the pipeline.
Requirements
To use Nextflow with MMCloud, you need the following:
- MMCloud Carmel 2.0 release or later
- Running instance of OpCenter with valid license
- Linux virtual machine running in same VPC as OpCenter (call this the Nextflow Host)
- On the Nextflow Host:
- Java 11 or later release (the latest Long Term Support release is Java 17)
- MMCloud CLI binary. You can download it from the OpCenter.
- Nextflow
- Nextflow configuration file
- Nextflow job file
- (Optional) NFS Server to provide shared working directory. There are other possibilities; for example, the shared working directory can be hosted on the Nextflow Host or the shared working directory can be mounted directly from AWS S3.
Prepare the Nextflow Host
The Nextflow Host is a Linux virtual machine running in the same VPC as the OpCenter. If the Nextflow host is in a different VPC subnet, ensure that the Nextflow host can reach the OpCenter and that it can mount the file system from the NFS Server (if used).
All network communication among the OpCenter, the Nextflow Host, NFS Server (if used), and Worker Nodes must use private IP addresses. If the Nextflow Host uses an NFS-mounted file system as the working directory, ensure that any firewall rules allow access to port 2049 (the port used by NFS).
-
Check the version of java installed on the Nextflow Host by entering:
-
If needed, install Java 11 or later. Commercial users of Oracle Java need a subscription. Alternatively, you can install OpenJDK under an open-source license by entering (on a Red Hat-based Linux system):
sudo dnf install java-17-openjdk
-
Install nextflow by entering:
sudo curl -s https://get.nextflow.io | bash
This installs nextflow in the current directory. The installation described here assumes that you install nextflow in your home directory. You can also create a directory for your nextflow installation, for example,
sudo mkdir ~/nextflow
-
Check your nextflow installation by entering:
$ ./nextflow run hello N E X T F L O W ~ version 23.04.2 Launching `https://github.com/nextflow-io/hello` [voluminous_liskov] DSL2 - revision: 1d71f857bb [master] executor > local (4) [13/1bb6ed] process > sayHello (3) [100%] 4 of 4 ✔ Bonjour world! Ciao world! Hola world! Hello world!
If this job does not run, check the log called .nextflow.log.
-
Upgrade to the latest version of Nextflow by entering ./nextflow self-update
-
Download the OpCenter CLI binary for Linux hosts from the following URL:
https://<op_center_ip_address>/float
Replace
<op_center_ip_address>
with the public (if you are outside the VPC) or private (if you are inside the VPC) IP address for the OpCenter. You can click on the link to download the CLI binary (called float) or you can enter the following.wget https://<op_center_ip_address>/float --no-check-certificate
If you download the CLI binary to your local machine, move the file to the Nextflow Host.
-
Make the CLI binary file executable and add the path to the CLI binary file to your PATH variable.
Note
You can use the float submit --template nextflow:jfs
command to create a Nextflow host with all the required software installed (including JuiceFS). Contact your MemVerge support team for additional details.
(Optional) Prepare the Working Directory Host
The Nextflow Host and the Worker Nodes must have access to a shared working directory. There are several ways to achieve this. In the example shown here, a separate Linux virtual machine (the NFS Server) is started in the same VPC as the OpCenter.
Alternatively, you can edit the Nextflow configuration file to automatically mount an S3 bucket as a filesystem. Instructions on how to do this are in the next section titled "Use S3 Bucket as Filesystem."
You can obtain instructions on turning a generic CentOS-based server into an NFS server from this link. NFS uses port 2049 for connections, so ensure that any firewall rules allow access to port 2049. If the Working Directory Host is in a different VPC subnet, ensure that it can reach the Nextflow host and Worker Nodes. Set the subnet mask in /etc/exports to allow the Nextflow Host and Worker Nodes to mount file systems from the Working Directory Host.
Example:
-
Log in to the NFS Server and create the shared working directory.
-
Log in to the Nextflow Server and mount the shared working directory (use the NFS Server's private IP address). Use
df
to check that the volume mounted successfully.
(Optional) Use S3 Bucket as Filesystem
Some workflows initiate hundreds or even thousands of tasks simultaneously. If all these tasks access the NFS server at the same time, a bottleneck can occur. For these workflows, it can help to use an S3 bucket as the working directory.
Note
When used with the appropriate configuration file, the Nextflow Host and the Worker Nodes automatically mount the S3 bucket as a linux file system.
Complete the following steps.
-
Log in to your AWS Management Console.
- Open the Amazon S3 console.
- From the left-hand panel, select Buckets.
-
On the right-hand side, click Create bucket and follow the instructions.
You must choose a bucket name (nfshareddir is used as a placeholder in this document) that is unique across all regions except China and the US government. Buckets are accessible across regions.
-
On the navigation bar, all the way to the right, click your username and go to Security credentials.
- Scroll down the page to the section called Access keys and click Create access key.
-
Download the csv file.
The csv file has two entries, one called Access key ID and one called Secret access key. You enter these in the Nextflow configuration file.
(Optional) Use Distributed File System
While NFS and S3 are viable options for providing the shared working directory, unacceptable performance may occur when running pipelines that generate I/O at high volume or high throughput. For these pipelines, the use of a high-performance distributed file system is recommended. OpCenter supports two distributed file systems.
-
Fusion
Fusion is a POSIX-compliant distributed file system optimized for Nextflow pipelines. Fusion requires the use of Wave containers. A description of how to use the Fusion file system with MMCloud is available here.
-
JuiceFS
JuiceFS is an open-source distributed file system that provides an API to access a POSIX-compliant file system built on top of a range of cloud storage services. If you use the
float submit --template nextflow:jfs
option to create a Nextflow host, the JuiceFS environment is automatically created.
Prepare the Configuration File
Nextflow configuration files can be extensive — they can include profiles for many executors. Create a simple configuration for using MMCloud as the sole executor by following these steps.
In the directory where you installed nextflow, create a file called nextflownfs.config. When a parameter requires an IP address, use a private IP address. The following configuration file uses the NFS server as the shared working directory.
$ cat nextflownfs.config
plugins {
id 'nf-float'
}
workDir = '/mnt/memverge/shared'
podman.registry ='quay.io'
executor {
queueSize = 100
}
float {
address = 'OPCENTER_IP_ADDRESS'
username = 'USERNAME'
password = 'PASSWORD'
nfs = 'nfs://NFS_SERVER_IP_ADDRESS/mnt/memverge/shared'
}
process {
executor = 'float'
container = 'docker.io/cactus'
}
Replace the following (keep the quotation marks).
USERNAME
: username to log in to the OpCenter. If absent, value of environment variableMMC_USERNAME
used.PASSWORD
: password to log in to the OpCenter. If absent, value of environment variableMMC_PASSWORD
used.OPCENTER_IP_ADDRESS
: private IP address of the OpCenter. If absent, value of environment variableMMC_ADDRESS
used. If using multiple OpCenters, separate entries with a comma.NFS_SERVER_IP_ADDRESS
: private IP address of the NFS server.
Nextflow secrets can supply values for USERNAME
and PASSWORD
as follows.
-
Set the values
-
Insert in Nextflow configuration file.
Explanations of the parameters in the configuration file follow.
-
plugins section
The MemVerge plugin called nf-float is included in the Nextflow Plugins index. This means that the reference to nf-float resolves to "nf-float version: latest" and Nextflow automatically downloads the latest version of the nf-float plugin. The nf-float plugin is updated frequently and configuration parameters change or new ones added. See the nf-float README on github for the latest details.
-
workDir specifies the path to the shared directory if the directory is NFS-mounted. If an S3 bucket is used, workDir specifies the bucket and path to folder in the format s3://bucket_name/folder
- podman.registry specifies the default container registry (the choices are usually quay.io or docker.io). If docker.io is specified, then /memverge/ is preprended to the container name. For example, 'cactus' becomes 'docker.io/memverge/cactus'.
-
executor section
queueSize limits the maximum number of concurrent requests sent to the OpCenter (default 100).
-
float section (all options are listed below)
- address: private IP address of the OpCenter. Specify multiple OpCenters using the format 'A.B.C.D', 'E.F.G.H' or 'A.B.C.D, E.F.G.H'.
- username, password: credentials for logging in to the OpCenter.
- nfs: parameter describing (only if using NFS) where the shared directory must be mounted from. Do not use with S3.
- The string following commonExtra is appended to the float command that is submitted to the OpCenter. The string shown here is an example: you can use any float command option.
- migratePolicy: the migration policy used by WaveRider, specified as a map. Refer to the CLI command reference for the list of available options.
- vmPolicy: the VM creation policy, specified as a map. Refer to the CLI command reference for the list of available options.
- timeFactor: a number (default value is 1) that multiplies the time supplied by the Nextflow task. Use it to prevent task timeouts.
- maxCpuFactor: a number (default value is 4) used to calculate the maximum number of CPU cores for the instance, namely, maximum number of CPU cores is set to maxCpuFactor multiplied by the cpus specified for the task.
- maxMemoryFactor: a number (default value is 4) used to calculate the maximum memory for the instance, namely, maximum memory is set to maxMemoryFactor multiplied by the memory specified for the task.
- commonExtra: a string that allows the user to specify additional options to
float submit
command. This string is appended to every float submit command.
-
process section
The nextflow language defines process "directives," which are optional parameters that influence the execution environment for tasks in the nextflow job file. If the nextflow job file does not specify a value for a process directive, the default value is used. Use the configuration file to override the nextflow defaults.
If a value for container is not specified and the task does not specify a container value, the job fails.
If scratch is set to true, process execution occurs in a temporary folder that is local to the execution node. For MMCloud, the execution node is a container running in a virtual machine. The process executes in the container's root volume (default 6GB). Increase the root volume size (to create space for the temporary folder) by including the extra directive.
For example,
process { scratch = true extra = '--imageVolSize 120' ... }
-
aws section
Specifies the credentials for accessing the S3 buckets if used.
The following configuration file uses the S3 bucket as the shared working directory. The Nextflow host and all worker nodes automatically mount the S3 bucket as a file system.
$ cat nextflows3.config
plugins {
id 'nf-float'
}
workDir = 's3://[S3BUCKET]'
podman.registry = 'quay.io'
executor {
queueSize = 100
}
float {
address = 'OPCENTER_IP_ADDRESS'
username = 'USERNAME'
password = 'PASSWORD'
}
process {
executor = 'float'
}
aws {
accessKey = 'ACCESS_KEY'
secretKey = 'SECRET_KEY'
region = 'REGION'
}
Replace the following (keep the quotation marks).
S3BUCKET
: name of the S3 bucket to use as the shared working directory (nfshareddir is used as a placeholder in this document)OPCENTER_IP_ADDRESS
: as described previouslyUSERNAME
: as described previouslyPASSWORD
: as described previouslyACCESS_KEY
: access key ID for your AWS account. See below for options for providing the access key ID.SECRET_KEY
: secret access key for your AWS account. See below for options for providing the secret access key.REGION
: region in which the S3 bucket is located.
Provide the access key ID and secret access key using one of the following methods.
- Enter the access key ID and secret access key as cleartext
- Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
- Set the environment variables AWS_ACCESS_KEY and AWS_SECRET_KEY
- Use the AWS CLI command
aws configure
to populate the default profile in the AWS credentials file located at ~/.aws/credentials - Use the temporary AWS credentials provided by an IAM instance role. See IAM Roles documentation for details.
Prepare the Nextflow Job File
The nextflow job file describes the workflow and how the workflow must be executed. To demonstrate how a simple workflow executes, follow these steps (example is adapted from nextflow.io.)
-
Create a sample fasta file called sample.fa and place it in the shared working directory, for example, in /mnt/memverge/shared if you are using the NFS server, or in s3://nfshareddir if you are using an S3 bucket (you can use any S3 bucket where you have r/w access — it does n't have to be the shared directory specified in the nextflow configuration file).
$ cat /mnt/memverge/shared/sample.fa >seq0 FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF >seq1 KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM >seq2 EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK >seq3 MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK >seq4 EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL >seq5 SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR >seq6 FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI >seq7 SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF >seq8 SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM >seq9 KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK >seq10 FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
-
Build a workflow that splits the fasta sequences into separate files and reverses the order by creating a nextflow job file called splitfanfs.nf.
If you are using the S3 bucket, replace params.in = "/mnt/memverge/shared/sample.fa" with params.in = "s3://nfshareddir/sample.fa" (use the name of the S3 bucket where you placed the sample fasta file if you did not place the file in nfshareddir).
$ cat splitfanfs.nf #!/usr/bin/env nextflow params.in = "/mnt/memverge/shared/sample.fa" /* * Split a fasta file into multiple files */ process splitSequences { executor = 'float' container = 'docker.io/memverge/cactus' cpus = '4' memory = '8 GB' extra = '--vmPolicy [spotOnly=true]' input: path 'input.fa' output: path 'seq_*' """ awk '/^>/{f="seq_"++d} {print > f}' < input.fa """ } /* * Reverse the sequences */ process reverse { executor = 'float' container = 'docker.io/memverge/cactus' cpus = '4' memory = '8 GB' input: path x output: stdout """ cat $x | rev """ } /* * Define the workflow */ workflow { splitSequences(params.in) \ | reverse \ | view }
The string following extra is combined with the string following commonExtra in the config file and appended to the float command submitted to the OpCenter. The string value shown here after extra is an example: use any float command option. The extra setting overrides the commonExtra setting if they are in conflict.
Run a Workflow using Nextflow
Run a simple workflow by entering:
$ ./nextflow run splitfanfs.nf -c nextflownfs.config -cache false
N E X T F L O W ~ version 23.04.2
Launching `splitfanfs.nf` [prickly_watson] DSL2 - revision: dadefd0d0b
executor > float (2)
[67/59f2c6] process > splitSequences [100%] 1 of 1 ✔
[f6/32fab1] process > reverse [100%] 1 of 1 ✔
0qes>
FKEIKKVDQAQDTRYVLCVLDDTVKICLNGDVHRYKLVVRVKMPDALYLKEAARSFEEWTQF
9qes>
KKLDPMNEARYQVCKVNDSMKLLIHGKSHVYKMTYRCNQPNAMYMNEAAIEFDEWNK
01qes>
KEMKKADQAQDTKFKLCEHNDTVKLVLKGECHRYKVVYRTTDPHNRFLEVSKSVFEDWSDF
1qes>
MLTFFINNLKEMKKAEQAQDTKFKLCEKNDTVKL EMLRMLQSHFKEIKKVDQAQDTRYLLCVVDDTVKICLNGDCHRYKLVVRVKMPDAQYLKEAARTFEEWTRYK
2qes>
KGHLKEVKKVDQAQDTKYQLCVADDTVKMCLNGDCHRYKLVVRVKMPDTLYLKEAARAFEEWTQYEE
3qes>
KVDQAQDTKYQLCVSNDTVKICLNGDCHRYKLVVRVKMPDTLYLKEVARSFEEWVQYM
4qes>
LLSHELAFNDKQVGFLRMEYSVVSNDTVKICLNGDCHRYKLVVRVKMPDTLYLKEVARSFEE
5qes>
RMLLTTLKEIKKVDQAMDTVYKLVTHNDTLKVVLKHDVHRYKTCMRCKMPDELYLVEAAKAFEEWS
6qes>
ISRLLTSSLKELKKVDQLQNTSYQLCVVDDTLKLVLEGKTHNYKTVFRCKEPNASHLREAAKAFEEWNTF
7qes>
FFINNLKEMKKADQAQDTKFKLCERDDTVKLVLKGECHRYKMVYRTANPDGRFLQVSREVFEEWS
8qes>
MLTFFINNLKEMKKAEQAQDTKFKLCEKNDTVKLVLKGDCHRYKMVYRTSEPDARFLQVSRDVFEDWS
Completed at: 02-Aug-2023 20:04:20
Duration : 9m 23s
CPU hours : (a few seconds)
Succeeded : 2
This nextflow job file defines two processes that use MMCloud as the executor. Using the float squeue
command or the OpCenter GUI, you can view the two processes executed by OpCenter.
$ float squeue -f image='cactus' -f status='Completed'
+--------+------------------+-------+-----------+-------------+----------+------------+
| ID | NAME | USER | STATUS |SUBMIT TIME | DURATION | COST |
+--------+------------------+-------+-----------+-------------+----------+------------+
| O7w... | cactus-m5.xlarge | admin | Completed | ...9:54:59Z | 2m54s | 0.0031 USD |
| L17... | cactus-m5.xlarge | admin | Completed | ...9:58:22Z | 5m21s | 0.0061 USD |
+--------+------------------+-------+-----------+-------------+----------+------------+
(edited)
Run an RNA Sequencing Workflow
This example (adapted from nextflow.io) uses publicly available data. Get the data here. For simple configuration, place this data in the shared working directory in a folder called /mnt/memverge/shared/nextflowtest/data/ggal if you are following the NFS example or in s3://nfshareddir/ggal if you are following the S3 bucket example. In general, input data is not stored in the shared working directory; for example, input data is often located in a publicly accessible S3 bucket.
-
Build a nextflow job file called rnaseqnfs.nf with the following content if you are following the NFS server example. If you are following the S3 example, create a folder called results in s3://nfshareddir and replace /mnt/memverge/shared/nextflowtest/data with s3://nfshareddir and replace /mnt/memverge/shared/results with s3://nfshareddir/results.
The rnaseq-nf image is not a "built-in" image in the OpCenter App Library. Specifying the URI for the rnaseq-nf image in the job file causes the OpCenter to pull the latest version of the image from the default registry.
$ cat rnaseqnfs.nf #!/usr/bin/env nextflow params.reads = "/mnt/memverge/shared/nextflowtest/data/ggal/ggal_gut_{1,2}.fq" params.transcriptome = "/mnt/memverge/shared/nextflowtest/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa" params.outdir = "/mnt/memverge/shared/results" workflow { read_pairs_ch = channel.fromFilePairs( params.reads, checkIfExists: true ) INDEX(params.transcriptome) FASTQC(read_pairs_ch) QUANT(INDEX.out, read_pairs_ch) } process INDEX { executor = 'float' container = 'nextflow/rnaseq-nf' cpus = '4' memory = '16 GB' tag "$transcriptome.simpleName" input: path transcriptome output: path 'index' script: """ salmon index --threads $task.cpus -t $transcriptome -i index """ } process FASTQC { executor = 'float' container = 'nextflow/rnaseq-nf' cpus = '4' memory = '16 GB' tag "FASTQC on $sample_id" publishDir params.outdir input: tuple val(sample_id), path(reads) output: path "fastqc_${sample_id}_logs" script: """ mkdir fastqc_${sample_id}_logs fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads} """ } process QUANT { executor = 'float' container = 'nextflow/rnaseq-nf' cpus = '4' memory = '16 GB' tag "$pair_id" publishDir params.outdir input: path index tuple val(pair_id), path(reads) output: path pair_id script: """ salmon quant --threads $task.cpus --libType=U -i $index -1 ${reads[0]} -2 ${reads[1]} -o $pair_id """ }
-
Run the workflow by entering
$ ./nextflow run rnaseqnfs.nf -c nextflownfs.config -cache false N E X T F L O W ~ version 23.04.2 Launching `rnaseqnfs.nf` [soggy_mccarthy] DSL2 - revision: fca2fdc7d3 executor > float (3) [10/7c2cc3] process > INDEX (ggal_1_48850000_49020000) [100%] 1 of 1 ✔ [d6/8c9f6b] process > FASTQC (FASTQC on ggal_gut) [100%] 1 of 1 ✔ [2f/378946] process > QUANT (ggal_gut) [100%] 1 of 1 ✔ Completed at: 02-Aug-2023 21:38:56 Duration : 6m 16s CPU hours : 0.2 Succeeded : 3
This nextflow job file defines three processes that use MMCloud as the executor. You can confirm that these processes execute on MMCloud.
cat float squeue -f image='rnaseq-nf-np6l7z' -f status='Completed' +-----------+------------+-------+-----------+-----------+---------+----------+ | ID | NAME | USER | STATUS |SUBMIT TIME| DURATION| COST | +-----------+------------+-------+-----------+-----------+---------+----------+ | IvwWWY... |rnaseq-nf...| admin | Completed |21:52:33Z | 2m38s |0.0051 USD| | lakXQg... |rnaseq-nf...| admin | Completed |21:52:35Z | 2m39s |0.0052 USD| | SyVSmL... |rnaseq-nf...| admin | Completed |21:55:08Z | 2m28s |0.0048 USD| +-----------+------------+-------+-----------+-----------+---------+----------+ (edited)
-
View the output at /mnt/memverge/shared/results (or s3://nfshareddir/results).
Some of the output is in html format, for example:
Integration with Seqera Platform
Seqera Platform is a product from Seqera Labs that is used to launch, monitor, and manage computational pipelines from a web interface. You can also launch a Nextflow pipeline using the CLI on the Nextflow Host and monitor the progress in a cloud-hosted Seqera Platform instance provided by Seqera Labs. Instructions are available here.
- Sign in to Seqera Platform. If you do not have an account, follow the instructions to register.
- Create an access token using the procedure described here. Copy the access token to your clipboard.
-
From a terminal on the Nextflow Host, enter:
Replace
eyxxxxxxxxxxxxxxxQ1ZTE=
with the access token you copied to the clipboard. -
Launch your Nextflow pipeline with the
with-tower
flag. For example:$ nextflow run rnaseqs3.nf -c nextflows3.config -cache false -with-tower N E X T F L O W ~ version 23.04.2 Launching `rnaseqs3.nf` [big_ramanujan] DSL2 - revision: 9c7f478123 Monitor the execution with Nextflow Tower using this URL: https://tower.nf/user/user_name/watch/1rjpVakrhQ3wAf executor > float (3) [31/94c152] process > INDEX (ggal_1_48850000_49020000) [100%] 1 of 1 ✔ [1e/3c939b] process > FASTQC (FASTQC on ggal_gut) [100%] 1 of 1 ✔ [e7/6ceef4] process > QUANT (ggal_gut) [100%] 1 of 1 ✔ Completed at: 02-Aug-2023 23:35:06 Duration : 5m 34s CPU hours : (a few seconds) Succeeded : 3
-
Open a browser and go to the URL.
Running the Nextflow Host as an MMCloud Job
Nextflow requires a Nextflow host to run the Nextflow executable. There are multiple ways of creating a Nextflow host:
- Standalone Linux server that you instantiate
- A containerized application that runs as an MMCloud job until you cancel the job (call this the "persistent Nextflow host")
- A containerized application that runs as an MMCloud job for the duration of the Nextflow pipeline and is then automatically canceled (call this the "transient Nextflow host")
MMCloud includes a template for creating a persistent Nextflow host and a container image for creating a transient Nextflow host. Both these solutions automatically create a JuiceFS file system as the shared work space required by Nextflow. With a configuration change, Fusion can be used instead of JuiceFS.
Persistent Nextflow Host in AWS
To deploy a persistent Nextflow host with JuiceFS enabled in AWS, complete the following steps.
- Log in to your AWS Management console
- Create a security group to allow inbound access to port 6868 (the port used by JuiceFS). Copy the security group ID (it is a string that looks like sg-0054f1eaadec3bc76).
-
Create an S3 bucket to support JuiceFS. Copy the URL for the S3 bucket (for example, https://juicyfsforcedric.s3.amazonaws.com)
Note
Do not include any folders in the S3 bucket URL.
-
Start the Nextflow host by entering the following command.
float submit --template nextflow:jfs -n JOBNAME -e BUCKET=BUCKETURL --migratePolicy [disable=true] --securityGroup SG_ID
Replace:
- JOBNAME: name to associate with job
- BUCKETURL: URL to locate S3 bucket
- SG_ID: security group ID
Example:
-
If security credentials are required to access the S3 bucket, add the following options to the
float submit
command.-e BUCKET_ACCESS_KEY={secret:KEY_NAME} -e BUCKET_SECRET_KEY={secret:SECRET_NAME}
and replace:
- KEY_NAME: name associated with access key ID
- SECRET_NAME: name associated secret access key
- Keep entering
float list
until the status of the job with the name JOBNAME changes to executing. Copy the ID associated with this job. - Retrieve the ssh key for this host by entering the following command.
Replace JOB_ID: the job ID associated with this job (prepend to '_SSHKEY').
Example:
-
Change the permissions on the ssh key file by entering the following.
-
Establish an ssh session with the Nextflow host by entering the following.
Replace:
USER
: username of the user who submitted the job to create the Nextflow host. If admin submitted the job, usenextflow
as the username.NEXTFLOW_HOST_IP
: public IP address of the Nextflow host.- Check that you are in the correct working directory, that the environment variables are set, and that the configuration template is available.
-
Make a copy of the template file by entering the following.
-
Edit the config file as follows.
# cat mmcloud.config plugins { id 'nf-float' } workDir = '/mnt/jfs/nextflow' process { executor = 'float' errorStrategy = 'retry' extra =' --dataVolume [opts=" --cache-dir /mnt/jfs_cache "]jfs://NEXTFLOW_HOST_PRIVATE_IP:6868/1:/mnt/jfs --dataVolume [size=120]:/mnt/jfs_cache' } podman.registry = 'quay.io' float { address = 'OPCENTER_PRIVATE_IP:443' username = 'USERNAME' password = 'PASSWORD' } // AWS access info if needed aws { client { maxConnections = 20 connectionTimeout = 300000 } /* accessKey = 'BUCKET_ACCESS_KEY' secretKey = 'BUCKET_SECRET_KEY' */ }
Replace:
- NEXTFLOW_HOST_PRIVATE_IP: private IP address of the Nextflow host.
- OPCENTER_PRIVATE_IP: private IP address of the OpCenter.
- USERNAME and PASSWORD: credentials to log in to the OpCenter.
- If needed, uncomment the block containing the S3 bucket credentials and insert values for BUCKET_ACCESS_KEY and BUCKET_SECRET_KEY.
You are now ready to submit a Nextflow pipeline following the usual procedure.
Note
Upon completion, each Nextflow pipeline leaves a working directory and other related files and directories in the JuiceFS file system, which maps to many small data chunks in the specified S3 bucket. When the Nextflow host is deleted, these data chunks remain in the S3 bucket, but are not readable. It is recommended that you periodically delete the working directory and related files and directories. Delete all files and directories before deleting the Nextflow host.
Example: Running an nf-core/sarek pipeline
- Sign in to Nextflow Tower. If you do not have an account, follow the instructions to register.
- Create an access token using the procedure described here. Copy the access token to your clipboard.
-
Start a tmux session by entering the following.
Replace SESSION_NAME with name to associate with tmux session.
Example:
Note
If the ssh session disconnects, re-establish the connection and reattach to the tmux session by entering the following.
-
At the terminal prompt, enter:
where
eyxxxxxxxxxxxxxxxQ1ZTE=
is the access token you copied to the clipboard. -
Run the pipeline by entering the following command.
nextflow run nf-core/sarek -c mmcloud.config -profile test_full --outdir 's3://OUTPUT_BUCKET' -cache false -with-tower
Replace OUTPUT_BUCKET with the S3 bucket (or S3 bucket/folder) where output is written to (you must have rw access to this bucket).
-
Open a browser and go to the Tower monitoring console.
- Click the Runs tab and select your job.
-
(Optional) When the pipeline completes, delete the working directory and related files.
Example:
Before deleting the Nextflow host, delete all files in the JuiceFS file system and then unmount the JuiceFS file system by entering the following commands.
Transient Nextflow Host in AWS
To deploy a transient Nextflow host with JuiceFS enabled in AWS, complete the following steps.
- Log in to your AWS Management console.
- Create a security group to allow inbound access to port 6868 (the port used by JuiceFS). Copy the security group ID (it is a string that looks like sg-0054f1eaadec3bc76).
- Create an S3 bucket to support JuiceFS.
- On your local machine, create a directory to act as the home directory for your Nextflow pipeline, and then
cd
to this directory. -
Download the host-init script from MemVerge's public repository by entering the following command (you don't need to edit this script, but you use it later).
wget https://mmce-data.s3.amazonaws.com/juiceflow/v1/aws/transient_JFS_AWS.sh
-
Download the job submit script from MemVerge's public repository by entering the following command (you need to edit this script).
wget https://mmce-data.s3.amazonaws.com/juiceflow/v1/aws/job_submit_AWS.sh
-
Edit
job_submit_AWS.sh
to customize it for your Nextflow pipeline. Here is an example that runs a simple "Hello World" pipeline.Note
The transient Nextflow host runs inside a container in a worker node. The
job_submit_AWS.sh
script executes anextflow run
command with its associated Nextflow script and configuration files. The Nextflow script and configuration files must be accessible inside the container. One way to accomplish this is to copy the Nextflow script and configuration files from an S3 bucket to a local volume mounted by the container. In the example shown,here
files are embedded injob_submit_AWS.sh
.#!/bin/bash # ---- User Configuration Section ---- # These configurations must be set by the user before running the script. # ---- Optional Configuration Section ---- # These configurations are optional and can be customized as needed. # JFS (JuiceFS) Private IP: Retrieved from the WORKER_ADDR environment variable. jfs_private_ip=$(echo $WORKER_ADDR) # Work Directory: Defines the root directory for working files. Optional suffix can be added. workDir_suffix='' workDir='/mnt/jfs/'$workDir_suffix mkdir -p $workDir # Ensures the working directory exists. cd $workDir # Changes to the working directory. export NXF_HOME=$workDir # Sets the NXF_HOME environment variable to the working directory. # ---- Nextflow Configuration File Creation ---- # This section creates a Nextflow configuration file with various settings for the pipeline execution. outbucket=$(echo $OUTBUCKET) # Use cat to create or overwrite the mmc.config file with the desired Nextflow configurations. # NOTE: S3 keys and OpCenter information appended to the end of the config file. No need to add them now # Modify the vmPolicy parameters as needed cat > mmc.config << EOF // enable nf-float plugin. plugins { id 'nf-float' } // Process settings: Executor, error strategy, and resource allocation specifics. process { executor = 'float' errorStrategy = 'retry' extra = '--dataVolume [opts=" --cache-dir /mnt/jfs_cache "]jfs://${jfs_private_ip}:6868/1:/mnt/jfs --dataVolume [size=120]:/mnt/jfs_cache --vmPolicy [spotOnly=true,retryLimit=10,retryInterval=300s]' } // Directories for Nextflow execution. workDir = '${workDir}' launchDir = '${workDir}' EOF cat > hw.nf << EOF #!/usr/bin/env nextflow process sayHello { container = 'docker.io/memverge/cactus' cpus = '4' memory = '8 GB' publishDir '${outbucket}/hwout', mode: 'copy', overwrite: true output: path 'sequences.txt' """ echo 'Hello World! This is a test of JuiceFlow using a transient head node.' """ } workflow { sayHello() } EOF # ---- Data Preparation ---- # Use this section to copy essential files from S3 to the working directory. # For example, copy the sample sheet and params.yml from S3 to the current working directory. # aws s3 cp s3://nextflow-input/samplesheet.csv . # aws s3 cp s3://nextflow-input/scripts/params.yml . # Copy your nextflow job file (with extension nf) into the container (for example, from S3) # The example shown uses a here file to create a simple hello world job # ---- Nextflow Command Setup ---- # Important: The -c option appends the mmc config file and soft overrides the nextflow configuration. # Assembles the Nextflow command with all necessary options and parameters. # This example uses a simple hello world job nextflow_command='nextflow run hw.nf \ --outdir $outbucket \ -c mmc.config ' # ------------------------------------- # ---- DO NOT EDIT BELOW THIS LINE ---- # ------------------------------------- # The following section contains functions and commands that should not be modified by the user. function install_float { # Install float local address=$(echo "$FLOAT_ADDR" | cut -d':' -f1) wget https://$address/float --no-check-certificate --quiet chmod +x float } function get_secret { input_string=$1 local address=$(echo "$FLOAT_ADDR" | cut -d':' -f1) secret_value=$(./float secret get $input_string -a $address) if [[ $? -eq 0 ]]; then # Have this secret, will use the secret value echo $secret_value return else # Don't have this secret, will still use the input string echo $1 fi } function remove_old_metadata () { echo $(date): "First finding and removing old metadata..." if [[ $BUCKET == *"amazonaws.com"* ]]; then # If default `amazonaws.com` endpoint url S3_MOUNT=s3://$(echo $BUCKET | sed 's:.*/::' | awk -F'[/.]s3.' '{print $1}') else # If no 'amazonaws.com,' the bucket is using a custom endpoint local bucket_name=$(echo $BUCKET | sed 's:.*/::' | awk -F'[/.]s3.' '{print $1}') S3_MOUNT="--endpoint-url $(echo "${BUCKET//$bucket_name.}") s3://$bucket_name" fi # If a previous job id was given, we use that as the old metadata if [[ ! -z $PREVIOUS_JOB_ID ]]; then echo $(date): "Previous job id $PREVIOUS_JOB_ID specified. Looking for metadata file in bucket..." FOUND_METADATA=$(aws s3 ls $S3_MOUNT | grep "$PREVIOUS_JOB_ID.meta.json.gz" | awk '{print $4}') fi if [[ -z "$FOUND_METADATA" ]]; then # If no previous job id was given, there is no old metadata to remove. echo $(date): "No previous metadata dump found. Continuing with dumping current JuiceFs" else echo $(date): "Previous metadata dump found! Removing $FOUND_METADATA" aws s3 rm $S3_MOUNT/$FOUND_METADATA echo $(date): "Previous metadata $FOUND_METADATA removed" fi } function dump_and_cp_metadata() { echo $(date): "Attempting to dump JuiceFS data" if [[ -z "$FOUND_METADATA" ]]; then # If no previous metadata was found, use the current job id juicefs dump redis://$(echo $WORKER_ADDR):6868/1 $(echo $FLOAT_JOB_ID).meta.json.gz --keep-secret-key echo $(date): "JuiceFS metadata $FLOAT_JOB_ID.meta.json.gz created. Copying to JuiceFS Bucket" aws s3 cp "$(echo $FLOAT_JOB_ID).meta.json.gz" $S3_MOUNT else # If previous metadata was found, use the id of the previous metadata # This means for all jobs that use the same mount, their id will always be their first job id metadata_name=$PREVIOUS_JOB_ID juicefs dump redis://$(echo $WORKER_ADDR):6868/1 $(echo $metadata_name).meta.json.gz --keep-secret-key echo $(date): "JuiceFS metadata $metadata_name.meta.json.gz created. Copying to JuiceFS Bucket" aws s3 cp "$(echo $metadata_name).meta.json.gz" $S3_MOUNT fi echo $(date): "Copying to JuiceFS Bucket complete!" } function copy_nextflow_log() { echo $(date): "Copying .nextflow.log to bucket.." if [[ ! -z $PREVIOUS_JOB_ID ]]; then aws s3 cp ".nextflow.log" $S3_MOUNT/$PREVIOUS_JOB_ID.nextflow.log echo $(date): "Copying .nextflow.log complete! You can find it with aws s3 ls $S3_MOUNT/$PREVIOUS_JOB_ID.nextflow.log" else aws s3 cp ".nextflow.log" $S3_MOUNT/$(echo $FLOAT_JOB_ID).nextflow.log echo $(date): "Copying .nextflow.log complete! You can find it with aws s3 ls $S3_MOUNT/$(echo $FLOAT_JOB_ID).nextflow.log" fi } # Variables S3_MOUNT="" FOUND_METADATA="" # Functions pre-Nextflow run # AWS S3 Access and Secret Keys: For accessing S3 buckets. install_float access_key=$(get_secret AWS_BUCKET_ACCESS_KEY) secret_key=$(get_secret AWS_BUCKET_SECRET_KEY) export AWS_ACCESS_KEY_ID=$access_key export AWS_SECRET_ACCESS_KEY=$secret_key opcenter_ip_address=$(get_secret OPCENTER_IP_ADDRESS) opcenter_username=$(get_secret OPCENTER_USERNAME) opcenter_password=$(get_secret OPCENTER_PASSWORD) # Append to config file cat <<EOT >> mmc.config // OpCenter connection settings. float { address = '${opcenter_ip_address}' username = '${opcenter_username}' password = '${opcenter_password}' } // AWS S3 Client configuration. aws { client { maxConnections = 20 connectionTimeout = 300000 } accessKey = '${access_key}' secretKey = '${secret_key}' } EOT # Create side script to tag head node - exits when properly tagged cat > tag_nextflow_head.sh << EOF #!/bin/bash runname="\$(cat .nextflow.log 2>/dev/null | grep nextflow-io-run-name | head -n 1 | grep -oP '(?<=nextflow-io-run-name:)[^ ]+')" workflowname="\$(cat .nextflow.log 2>/dev/null | grep nextflow-io-project-name | head -n 1 | grep -oP '(?<=nextflow-io-project-name:)[^ ]+')" while true; do # Runname and workflowname will be populated at the same time # If the variables are populated and not tagged it, tag the head node if [ ! -z \$runname ]; then ./float modify -j "$(echo $FLOAT_JOB_ID)" --addCustomTag run-name:\$runname 2>/dev/null ./float modify -j "$(echo $FLOAT_JOB_ID)" --addCustomTag workflow-name:\$workflowname 2>/dev/null break fi runname="\$(cat .nextflow.log 2>/dev/null | grep nextflow-io-run-name | head -n 1 | grep -oP '(?<=nextflow-io-run-name:)[^ ]+')" workflowname="\$(cat .nextflow.log 2>/dev/null | grep nextflow-io-project-name | head -n 1 | grep -oP '(?<=nextflow-io-project-name:)[^ ]+')" sleep 1s done EOF # Start tagging side-script chmod +x ./tag_nextflow_head.sh ./tag_nextflow_head.sh & # Start Nextflow run $nextflow_command if [[ $? -ne 0 ]]; then echo $(date): "Nextflow command failed." remove_old_metadata dump_and_cp_metadata copy_nextflow_log exit 1 else echo $(date): "Nextflow command succeeded." remove_old_metadata dump_and_cp_metadata copy_nextflow_log exit 0 fi
-
Use the
float secret
command to store sensitive variables.Replace:float secret set OPCENTER_IP_ADDRESS OC_PRIVATE_IP float secret set OPCENTER_USERNAME NAME float secret set OPCENTER_PASSWORD PASSWORD float secret set AWS_BUCKET_ACCESS_KEY KEY float secret set AWS_BUCKET_SECRET_KEY SECRET
OC_PRIVATE_IP
: private IP address of OpCenterNAME
andPASSWORD
: credentials to access OpCenterKEY
andSECRET
: credentials to access S3 bucket
-
Submit the Nextflow pipeline as an MMCloud job. For simplicity, you can insert the
float submit
command (with options) into a shell script.$ cat run_flow.sh float submit --hostInit transient_JFS_AWS.sh \ -i docker.io/memverge/juiceflow \ --vmPolicy '[onDemand=true]' \ --migratePolicy '[disable=true]' \ --dataVolume '[size=60]:/mnt/jfs_cache' \ --dirMap /mnt/jfs:/mnt/jfs \ -c 2 -m 4 \ -n JOB_NAME \ --securityGroup SG_ID \ --env BUCKET=https://BUCKET_NAME.s3.REGION.amazonaws.com \ --env 'OUTBUCKET=s3://OUTBUCKET_NAME' \ -j job_submit_AWS.sh $ chmod +x run_flow.sh $ ./run_flow.sh
Replace:
JOB_NAME
: name to associate with transient Nextflow hostSG_ID
: security group ID to open port 6868BUCKET_NAME
: S3 bucket you created for this pipelineREGION
: region where S3 bucket is locatedOUTBUCKET_NAME
: S3 where results are written
-
Check that the Nextflow pipeline completes sucessfully.
The Nextflow log is written to the S3 bucket you created for this pipeline. Find the name of the log file by viewing the contents of
stdout.autosave
for JOB_NAME, for example,``` ... Wed Jul 24 21:30:52 UTC 2024: Copying .nextflow.log to bucket.. Completed 29.3 KiB/29.3 KiB (240.7 KiB/s) with 1 file(s) remaining upload: ./.nextflow.log to s3://welcometojuicefs/21yjg22ze9qlt74ls69k5.nextflow.log Wed Jul 24 21:30:56 UTC 2024: Copying .nextflow.log complete! You can find it with aws s3 ls s3://welcometojuicefs/21yjg22ze9qlt74ls69k5.nextflow.log [edited] ```
For this simple "Hello World" pipeline, the Nextflow job creates one executor to run the "Hello World" script. Check the contents of
hwout/sequences.txt
in the S3 bucket you used forOUTBUCKET_NAME
.``` Hello World! This is a test of JuiceFlow using a transient head node. ```
Using Fusion with MMCloud
Fusion is only used with Nextflow pipelines. Using Nextflow with MMCloud requires nf-float, the Nextflow plugin for MMCloud. A description of nf-float, its use, and the configuration changes required to use Fusion, are available here.
When Fusion is used with MMCloud, SpotSurfer and WaveRider are not supported. To turn off WaveRider and to specify On-demand Instances when using Fusion, use the following Nextflow configuration file.
plugins {
id 'nf-float'
}
workDir = 's3://S3_BUCKET'
// limit concurrent requests sent to MMCE
// by default, it's 100
executor {
queueSize = 20
}
podman.registry = 'quay.io'
process {
executor = 'float'
errorStrategy = 'retry'
cpus = 2
memory = '4 GB'
}
wave {
enabled = true
}
fusion {
enabled = true
exportStorageCredentials = true
exportAwsAccessKeys = true
}
float {
address = 'OPCENTER_PRIVATE_IP'
username = 'USERNAME'
password = 'PASSWORD'
vmPolicy = [
onDemand: true,
retryLimit: 3,
retryInterval: '10m'
]
migratePolicy = [
disable: true
]
}
aws {
accessKey = 'BUCKET_ACCESS_KEY'
secretKey = 'BUCKET_SECRET_KEY'
}
Troubleshooting
As the nextflow job runs, log messages are written to a log file called ".nextflow.log", created in the directory where the nextflow job is running.