Configuring Data Volumes

Data Volumes

Architecture

After the user submits a job to the OpCenter (using the CLI or the web interface), the OpCenter selects and then instantiates a virtual machine (host). After the host is up and running, the host loads (and then starts) the container specified by the user. This is the worker node that executes the job.

There are existing volumes that are automatically mounted by the worker node and there are new volumes that are automatically created (and then mounted) by the worker node. The latter volumes are deleted once the job completes. The data volumes that are specified by the user when the user submits the job are mounted by the container running inside the worker node. The relationship of the different data volumes to the OpCenter and to the worker node is shown in the figure.

Figure 1. Volumes mounted by Worker Node

The Worker Node automatically mounts the following volumes.

Directory exported by the OpCenter to share binaries with the Worker Node and to provide an area for the Worker Node to store log files.
Root volume automatically created by the CSP when the host instance starts. The MMCloud user can specify the root volume capacity with the float submit --rootVolSize option or accept the default size of 40 GB.
Container image volume (/mnt/float-image) that acts as the root volume for the container. The MMCloud user can specify the image volume capacity with the float submit --imageVolSize option or accept the size automatically calculated by the OpCenter.
AppCapsule volume (/mnt/float-data) to store data and metadata needed to resume executing a job on a different host. The OpCenter automatically sizes the AppCapsule volume based on the memory capacity of the Worker Node.

The root, container image, and AppCapsule volumes are deleted after the job completes and the host is deleted.

The worker node also mounts data volumes if they are specified when the job is submitted. Data volumes are specified by using float submit with the --dataVolume option. Specify multiple data volumes by using --dataVolume multiple times in a single float submit command.

The OpCenter supports the following data volume types.

AWS Elastic Block Store (EBS) volume created by OpCenter for the duration of the job.
Existing EBS volume (for example, previously created by the AWS account holder) that persists after the job completes.
Directories that are exported by an NFS server and NFS-mounted by the worker node.
AWS Simple Storage Service (S3) bucket mounted as a file system by the worker node using a FUSE (Filesystem in USErspace) client.
Distributed file system that provides a POSIX-compliant API to access files where the underlying data is split into chunks and stored in one or more object storage systems (for example, S3). Files are only accessible using the client. The current release supports the following distributed file systems.
- JuiceFS
  JuiceFS is a high-performance, open-source distributed file system that can use a range of cloud storage services as the underlying object store. In the current OpCenter release, JuiceFS is used with S3.
  JuiceFS separates raw data (for example, data chunks stored in S3) from the metadata used to locate and access the files. The metadata is stored in a database. The OpCenter instantiates a temporary Redis database. When the Redis database is deleted, files stored in JuiceFS are not accessible, so JuiceFS is considered a non-persistent file system in MMCloud
- Fusion
  Fusion is a distributed file system developed by Seqera for Nextflow pipelines to use cloud storage services provided by AWS S3 or Google Cloud Storage. Fusion is implemented as a FUSE client with an NVMe-based cache fronting a cloud storage service. Fusion requires the use of Wave, a container provisioning service integrated with Nextflow. SpotSurfer and WaveRider are not supported when Fusion is used with MMCloud
  .

The following table compares data volume types.

Table 1. Comparison of data volume types
Data Volume Type	Persistence	Performance	Capacity	Cost	Ease of Use
New EBS volume	No	High	Low	High	High
Existing EBS volume	Yes	High	Low	High	High
NFS mount	Yes	High	Medium	Medium	Low if need to create NFS server. Else, high.
S3 bucket	Yes	Low	High	Low	Medium
JuiceFS	No	High	High	Low	Medium
Fusion	Yes if cache is flushed	High	High	Low (open-source). High (commercial version).	Low

To Specify New EBS Volume

Use the float submit command with the -D or --dataVolume option as follows.

--dataVolume [size=SIZE,throughput=RATE]:/DATADIR

Replace:

SIZE: capacity of EBS volume in GB
RATE: I/O throughput in Mbps (can omit the throughput parameter)
DATADIR: point where worker node mounts the EBS volume

Example:

--dataVolume [size=10]:/data

To Specify Existing EBS Volume

Log in to your AWS Management console
Go to the EC2 Dashboard
If needed, create an EBS volume by completing the following steps.
- From the left-hand panel, select Elastic Block Store > Volumes
- On top, right-hand corner click Create volume
- Fill in the form and then click Create volume at the bottom, right-hand side of the page. If the volume is created successfully, you are returned to the Volumes console.
  Note: AWS offers EBS volumes with different performance characteristics. Select gp3 for a general purpose volume. Select io2 for a high-performance volume.
From the Volumes console, select the volume to use
Copy the Volume ID. It is a number that looks like vol-08eafe32dac03f9a0.
Use the float submit command with the -D or --dataVolume option as follows.
--dataVolume VOLUME_ID:/DATADIR
Replace:
- VOLUME_ID: volume ID identifying the existing EBS volume
- DATADIR: point where worker node mounts the EBS volume

Example:

--dataVolume vol-08eafe32dac03f9a0:/data

To Specify NFS Mount

If needed, create an NFS server in the same VPC as the OpCenter. To use the NFS server successfully, check the following items.
- There is network connectivity between the private IP addresses of the NFS server and the OpCenter.
- The NFS server allows inbound access to port 2049 (apply an appropriate security group if needed).
- The subnet mask in /etc/exports allows the worker node to mount file systems from the NFS server.
Use the float submit command with the -D or --dataVolume option as follows.
--dataVolume nfs://NFS_PRIVATE_IP/EXPORTED_DIR:/MOUNTED_DIR
Replace:
- NFS_PRIVATE_IP: private IP address of the NFS server
- EXPORTED_DIR: directory (or path to directory) exported by the NFS server
- MOUNTED_DIR: mount point (or path to mount point) where worker node mounts the exported directory

Example:

--dataVolume nfs://172.31.53.99/mnt/memverge/shared:/data

To Specify S3 Bucket

If needed, create an S3 bucket by completing the following steps.
- Log in to your AWS Management Console.
- Open the Amazon S3 console.
- From the left-hand panel, select Buckets.
- On the right-hand side, click Create bucket and follow the instructions.
  Note: You must choose a bucket name that is unique across all regions except China and the US government. Buckets are accessible across regions.
Use the float submit command with the -D or --dataVolume option as follows.
--dataVolume [mode=rw]s3://BUCKETNAME:/MOUNTED_DIR
Replace:
- BUCKETNAME: S3 bucket name (or bucket-name/subfolder)
- MOUNTED_DIR: mount point (or path to mount point) where worker node mounts the S3 bucket

Example:

--dataVolume [mode=rw]s3://nfshareddir:/data

Access Keys

Depending on how your AWS account is set up, you may have to provide security credentials to enable access to the S3 bucket you created. To generate security credentials for your AWS account, complete the following steps.

Log in to your AWS Management Console.
On the navigation bar, all the way to the right, click your username and go to Security credentials.
Scroll down the page to the section called Access keys and click Create access key.
Download the csv file. The csv file has two entries, one called Access key ID and one called Secret access key.

If the S3 bucket belongs to another AWS user, obtain the security credentials from the S3 bucket owner.

To use an S3 bucket that requires security credentials, complete the following steps.

Store the security credentials (in encrypted form) in the OpCenter secret manager by entering the following two commands.
```
float secret set KEY_NAME ACCESS_KEY_ID
float secret set SECRET_NAME SECRET_ACCESS_KEY
```
Replace:
- KEY_NAME: name to associate with the access key ID
- ACCESS_KEY_ID: access key ID
- SECRET_NAME: name to associate with secret access key
- SECRET_ACCESS_KEY: secret access key
Example:
```
float secret set bucketaccesskeyid A***C
Set bucketaccesskeyid successfully
float secret set bucketaccesssecret X***S
Set bucketaccesssecret successfully
```

Retrieve the security credentials by entering the following two commands.

float secret get KEY_NAME
 A***C
 float secret get SECRET_NAME
 X***S

Use the float submit command with the -D or --dataVolume option as follows.
--dataVolume [accesskey=A***C,secret=X***S,mode=rw]s3://BUCKETNAME:/MOUNTED_DIR
Replace:
- A***C: access key ID
- X***S: secret access key
- BUCKETNAME: S3 bucket name (or bucket-name/subfolder)
- MOUNTED_DIR: mount point (or path to mount point) where worker node mounts the S3 bucket

To Specify JuiceFS

JuiceFS is a general purpose, distributed file system that can be used with any application. In the current MMCloud release, JuiceFS can only be used in conjunction with a Nextflow host deployed using the OpCenter's built-in nextflow job template.

To deploy a Nextflow host with JuiceFS enabled in AWS, complete the following steps.

Log in to your AWS Management console
Create a security group to allow inbound access to port 6868 (the port used by JuiceFS). Copy the security group ID (it is a string that looks like sg-0054f1eaadec3bc76).
Create an S3 bucket to support JuiceFS. Copy the URL for the S3 bucket (for example, https://juicyfsforcedric.s3.amazonaws.com)
Note: Do not include any folders in the S3 bucket URL.
Start the Nextflow host by entering the following command.
float submit --template nextflow:jfs -n JOBNAME -e BUCKET=BUCKETURL --migratePolicy [disable=true] --securityGroup SG_ID
Replace:
- JOBNAME: name to associate with job
- BUCKETURL: URL to locate S3 bucket
- SG_ID: security group ID
Example:
```
float submit --template nextflow:jfs -n tjfs -e BUCKET=https://juicyfsforcedric.s3.amazonaws.com --migratePolicy [disable=true] --securityGroup sg-0054f1eaadec3bc76
```
If security credentials are required to access the S3 bucket, add the following options to the float submit command.
-e BUCKET_ACCESS_KEY={secret:KEY_NAME} -e BUCKET_SECRET_KEY={secret:SECRET_NAME}
and replace:
- KEY_NAME: name associated with access key ID
- SECRET_NAME: name associated secret access key
Keep entering float list until the status of the job with the name JOBNAME changes to executing. Copy the ID associated with this job.
Retrieve the ssh key for this host by entering the following command.
```
float secret get JOB_ID_SSHKEY > jfs_ssh.key
```
Replace JOB_ID: the job ID associated with this job (prepend to '_SSHKEY').
Example:
```
float secret get S2zliQLp7NnNjFeUeVjOe_SSHKEY > jfs_ssh.key
```
Change the permissions on the ssh key file by entering the following.
```
chmod 600 jfs_ssh.key
```
Establish an ssh session with the Nextflow host by entering the following.
```
ssh -i jfs_ssh.key USER@NEXTFLOW_HOST_IP
```
Replace:
- USER: username of the user who submitted the job to create the Nextflow host. If admin submitted the job, use root.
- NEXTFLOW_HOST_IP: public IP address of the Nextflow host.
Check that you are in the correct working directory, that the environment variables are set, and that the configuration template is available.
```
# pwd
/mnt/jfs/nextflow
# env|grep HOME
HOME=/mnt/jfs/nextflow
NXF_HOME=/mnt/jfs/nextflow
# ls
mmcloud.config.template
```
Make a copy of the template file by entering the following.
```
cp mmcloud.config.template mmcloud.config
```

Edit the config file as follows.

# cat mmcloud.config
plugins {
  id 'nf-float'
}

workDir = '/mnt/jfs/nextflow'

process {
    executor = 'float'
    errorStrategy = 'retry'
    extra ='  --dataVolume [opts=" --cache-dir /mnt/jfs_cache "]jfs://NEXTFLOW_HOST_PRIVATE_IP:6868/1:/mnt/jfs --dataVolume [size=120]:/mnt/jfs_cache'
}

podman.registry = 'quay.io'

float {
    address = 'OPCENTER_PRIVATE_IP:443'
    username = 'USERNAME'
    password = 'PASSWORD'
}

// AWS access info if needed
aws {
  client {
    maxConnections = 20
    connectionTimeout = 300000
  }
/*
  accessKey = 'BUCKET_ACCESS_KEY'
  secretKey = 'BUCKET_SECRET_KEY'
*/
}

Replace:

NEXTFLOW_HOST_PRIVATE_IP: private IP address of the Nextflow host.
OPCENTER_PRIVATE_IP: private IP address of the OpCenter.
USERNAME and PASSWORD: credentials to log in to the OpCenter.
If needed, uncomment the block containing the S3 bucket credentials and insert values for BUCKET_ACCESS_KEY and BUCKET_SECRET_KEY.

You are now ready to submit a Nextflow pipeline following the usual procedure.

Important: Upon completion, each Nextflow pipeline leaves a working directory and other related files and directories in the JuiceFS file system, which maps to many small data chunks in the specified S3 bucket. When the Nextflow host is deleted, these data chunks remain in the S3 bucket, but are not readable. It is recommended that you periodically delete the working directory and related files and directories. Delete all files and directories before deleting the Nextflow host.

Example: Running an nf-core/sarek pipeline

Sign in to Nextflow Tower. If you do not have an account, follow the instructions to register.
Create an access token using the procedure described here. Copy the access token to your clipboard.
Start a tmux session by entering the following.
```
tmux new -s SESSION_NAME
```
Replace SESSION_NAME with name to associate with tmux session.
Example:
```
tmux new -s nfjob
```
Note: If the ssh session disconnects, re-establish the connection and reattach to the tmux session by entering the following.
```
tmux attach -t SESSION_NAME
```
At the terminal prompt, enter:
```
export TOWER_ACCESS_TOKEN=eyxxxxxxxxxxxxxxxQ1ZTE=
```
where eyxxxxxxxxxxxxxxxQ1ZTE= is the access token you copied to the clipboard.
Run the pipeline by entering the following command.
```
nextflow run nf-core/sarek -c mmcloud.config -profile test_full --outdir 's3://OUTPUT_BUCKET' -cache false -with-tower
```
Replace OUTPUT_BUCKET with the S3 bucket (or S3 bucket/folder) where output is written to (you must have rw access to this bucket).
Open a browser and go to the Tower monitoring console.
Click the Runs tab and select your job.
(Optional) When the pipeline completes, delete the working directory and related files.
Example:
```
rm -r 0*
rm *.tsv
```

Before deleting the Nextflow host, delete all files in the JuiceFS file system and then unmount the JuiceFS file system by entering the following commands.

rm -rf /mnt/jfs/*
umount /mnt/jfs

To Specify Fusion

Fusion is only used with Nextflow pipelines. Using Nextflow with MMCloud requires nf-float, the Nextflow plugin for MMCloud. A description of nf-float, its use, and the configuration changes required to use Fusion, are available here.

When Fusion is used with MMCloud, SpotSurfer and WaveRider are not supported. To turn off WaveRider and to specify On-demand Instances when using Fusion, use the following Nextflow configuration file.

plugins {
  id 'nf-float'
}

workDir = 's3://S3_BUCKET'

// limit concurrent requests sent to MMCE
// by default, it's 100
executor {
    queueSize = 20
}

podman.registry = 'quay.io'

process {
    executor = 'float'
    errorStrategy = 'retry'
    cpus = 2
    memory = '4 GB'
}

wave {
  enabled = true
}

fusion {
  enabled                  = true
  exportStorageCredentials = true
  exportAwsAccessKeys      = true
}

float {
    address = 'OPCENTER_PRIVATE_IP'
    username = 'USERNAME'
    password = 'PASSWORD'
    vmPolicy = [
        onDemand: true,
        retryLimit: 3,
        retryInterval: '10m'
    ]
    migratePolicy = [
        disable: true
    ]
}

aws {
    accessKey = 'BUCKET_ACCESS_KEY'
    secretKey = 'BUCKET_SECRET_KEY'
}