Configuring Data Volumes

A job that generates file system I/O requires a definition for one or more file systems.

Data Volumes

If you submit a job that generates file system I/O, that is, the job reads from or writes to files, you must specify the data volumes (the data storage spaces) to support the file systems created by the worker node. The OpCenter supports a range of data volume types. The type that is appropriate for your job depends on your requirements including the following.
  • Persistence after the job ends
  • I/O performance
  • Storage capacity
  • Cost
Important: For compliance (or other) reasons, some organizations require that the working directories generated by Nextflow or Cromwell workflows are archived securely. In such cases, a non-persistent data volume is not appropriate.

Architecture

After the user submits a job to the OpCenter (using the CLI or the web interface), the OpCenter selects and then instantiates a virtual machine (host). After the host is up and running, the host loads (and then starts) the container specified by the user. This is the worker node that executes the job.

There are existing volumes that are automatically mounted by the worker node and there are new volumes that are automatically created (and then mounted) by the worker node. The latter volumes are deleted once the job completes. The data volumes that are specified by the user when the user submits the job are mounted by the container running inside the worker node. The relationship of the different data volumes to the OpCenter and to the worker node is shown in the figure.

Figure 1. Volumes mounted by Worker Node
The Worker Node automatically mounts the following volumes.
  • Directory exported by the OpCenter to share binaries with the Worker Node and to provide an area for the Worker Node to store log files.
  • Root volume automatically created by the CSP when the host instance starts. The MMCloud user can specify the root volume capacity with the float submit --rootVolSize option or accept the default size of 40 GB.
  • Container image volume (/mnt/float-image) that acts as the root volume for the container. The MMCloud user can specify the image volume capacity with the float submit --imageVolSize option or accept the size automatically calculated by the OpCenter.
  • AppCapsule volume (/mnt/float-data) to store data and metadata needed to resume executing a job on a different host. The OpCenter automatically sizes the AppCapsule volume based on the memory capacity of the Worker Node.

The root, container image, and AppCapsule volumes are deleted after the job completes and the host is deleted.

The worker node also mounts data volumes if they are specified when the job is submitted. Data volumes are specified by using float submit with the --dataVolume option. Specify multiple data volumes by using --dataVolume multiple times in a single float submit command.

The OpCenter supports the following data volume types.
  • AWS Elastic Block Store (EBS) volume created by OpCenter for the duration of the job.
  • Existing EBS volume (for example, previously created by the AWS account holder) that persists after the job completes.
  • Directories that are exported by an NFS server and NFS-mounted by the worker node.
  • AWS Simple Storage Service (S3) bucket mounted as a file system by the worker node using a FUSE (Filesystem in USErspace) client.
  • Distributed file system that provides a POSIX-compliant API to access files where the underlying data is split into chunks and stored in one or more object storage systems (for example, S3). Files are only accessible using the client. The current release supports the following distributed file systems.
    • JuiceFS

      JuiceFS is a high-performance, open-source distributed file system that can use a range of cloud storage services as the underlying object store. In the current OpCenter release, JuiceFS is used with S3.

      JuiceFS separates raw data (for example, data chunks stored in S3) from the metadata used to locate and access the files. The metadata is stored in a database. The OpCenter instantiates a temporary Redis database. When the Redis database is deleted, files stored in JuiceFS are not accessible, so JuiceFS is considered a non-persistent file system in MMCloud

    • Fusion

      Fusion is a distributed file system developed by Seqera for Nextflow pipelines to use cloud storage services provided by AWS S3 or Google Cloud Storage. Fusion is implemented as a FUSE client with an NVMe-based cache fronting a cloud storage service. Fusion requires the use of Wave, a container provisioning service integrated with Nextflow. SpotSurfer and WaveRider are not supported when Fusion is used with MMCloud

      .

The following table compares data volume types.

Table 1. Comparison of data volume types
Data Volume Type Persistence Performance Capacity Cost Ease of Use
New EBS volume No High Low High High
Existing EBS volume Yes High Low High High
NFS mount Yes High Medium Medium Low if need to create NFS server. Else, high.
S3 bucket Yes Low High Low Medium
JuiceFS No High High Low Medium
Fusion Yes if cache is flushed High High Low (open-source).

High (commercial version).

Low

To Specify New EBS Volume

Use the float submit command with the -D or --dataVolume option as follows.

--dataVolume [size=SIZE,throughput=RATE]:/DATADIR

Replace:
  • SIZE: capacity of EBS volume in GB
  • RATE: I/O throughput in Mbps (can omit the throughput parameter)
  • DATADIR: point where worker node mounts the EBS volume
Example:
--dataVolume [size=10]:/data

To Specify Existing EBS Volume

  • Log in to your AWS Management console
  • Go to the EC2 Dashboard
  • If needed, create an EBS volume by completing the following steps.
    • From the left-hand panel, select Elastic Block Store > Volumes
    • On top, right-hand corner click Create volume
    • Fill in the form and then click Create volume at the bottom, right-hand side of the page. If the volume is created successfully, you are returned to the Volumes console.
      Note: AWS offers EBS volumes with different performance characteristics. Select gp3 for a general purpose volume. Select io2 for a high-performance volume.
  • From the Volumes console, select the volume to use
  • Copy the Volume ID. It is a number that looks like vol-08eafe32dac03f9a0.
  • Use the float submit command with the -D or --dataVolume option as follows.

    --dataVolume VOLUME_ID:/DATADIR

  • Replace:
    • VOLUME_ID: volume ID identifying the existing EBS volume
    • DATADIR: point where worker node mounts the EBS volume
Example:
--dataVolume vol-08eafe32dac03f9a0:/data

To Specify NFS Mount

  • If needed, create an NFS server in the same VPC as the OpCenter. To use the NFS server successfully, check the following items.
    • There is network connectivity between the private IP addresses of the NFS server and the OpCenter.
    • The NFS server allows inbound access to port 2049 (apply an appropriate security group if needed).
    • The subnet mask in /etc/exports allows the worker node to mount file systems from the NFS server.
  • Use the float submit command with the -D or --dataVolume option as follows.

    --dataVolume nfs://NFS_PRIVATE_IP/EXPORTED_DIR:/MOUNTED_DIR

  • Replace:
    • NFS_PRIVATE_IP: private IP address of the NFS server
    • EXPORTED_DIR: directory (or path to directory) exported by the NFS server
    • MOUNTED_DIR: mount point (or path to mount point) where worker node mounts the exported directory
Example:
--dataVolume nfs://172.31.53.99/mnt/memverge/shared:/data

To Specify S3 Bucket

  • If needed, create an S3 bucket by completing the following steps.
    • Log in to your AWS Management Console.
    • Open the Amazon S3 console.
    • From the left-hand panel, select Buckets.
    • On the right-hand side, click Create bucket and follow the instructions.
      Note: You must choose a bucket name that is unique across all regions except China and the US government. Buckets are accessible across regions.
  • Use the float submit command with the -D or --dataVolume option as follows.

    --dataVolume [mode=rw]s3://BUCKETNAME:/MOUNTED_DIR

  • Replace:
    • BUCKETNAME: S3 bucket name (or bucket-name/subfolder)
    • MOUNTED_DIR: mount point (or path to mount point) where worker node mounts the S3 bucket
Example:
--dataVolume [mode=rw]s3://nfshareddir:/data

Access Keys

Depending on how your AWS account is set up, you may have to provide security credentials to enable access to the S3 bucket you created. To generate security credentials for your AWS account, complete the following steps.

  • Log in to your AWS Management Console.
  • On the navigation bar, all the way to the right, click your username and go to Security credentials.
  • Scroll down the page to the section called Access keys and click Create access key.
  • Download the csv file. The csv file has two entries, one called Access key ID and one called Secret access key.

If the S3 bucket belongs to another AWS user, obtain the security credentials from the S3 bucket owner.

To use an S3 bucket that requires security credentials, complete the following steps.

  • Store the security credentials (in encrypted form) in the OpCenter secret manager by entering the following two commands.
    float secret set KEY_NAME ACCESS_KEY_ID
    float secret set SECRET_NAME SECRET_ACCESS_KEY
    Replace:
    • KEY_NAME: name to associate with the access key ID
    • ACCESS_KEY_ID: access key ID
    • SECRET_NAME: name to associate with secret access key
    • SECRET_ACCESS_KEY: secret access key
    Example:
    float secret set bucketaccesskeyid A***C
    Set bucketaccesskeyid successfully
    float secret set bucketaccesssecret X***S
    Set bucketaccesssecret successfully
  • Retrieve the security credentials by entering the following two commands.
    float secret get KEY_NAME
     A***C
     float secret get SECRET_NAME
     X***S
  • Use the float submit command with the -D or --dataVolume option as follows.

    --dataVolume [accesskey=A***C,secret=X***S,mode=rw]s3://BUCKETNAME:/MOUNTED_DIR

    Replace:
    • A***C: access key ID
    • X***S: secret access key
    • BUCKETNAME: S3 bucket name (or bucket-name/subfolder)
    • MOUNTED_DIR: mount point (or path to mount point) where worker node mounts the S3 bucket

To Specify JuiceFS

JuiceFS is a general purpose, distributed file system that can be used with any application. In the current MMCloud release, JuiceFS can only be used in conjunction with a Nextflow host deployed using the OpCenter's built-in nextflow job template.

To deploy a Nextflow host with JuiceFS enabled in AWS, complete the following steps.

  • Log in to your AWS Management console
  • Create a security group to allow inbound access to port 6868 (the port used by JuiceFS). Copy the security group ID (it is a string that looks like sg-0054f1eaadec3bc76).
  • Create an S3 bucket to support JuiceFS. Copy the URL for the S3 bucket (for example, https://juicyfsforcedric.s3.amazonaws.com)
    Note: Do not include any folders in the S3 bucket URL.
  • Start the Nextflow host by entering the following command.

    float submit --template nextflow:jfs -n JOBNAME -e BUCKET=BUCKETURL --migratePolicy [disable=true] --securityGroup SG_ID

    Replace:
    • JOBNAME: name to associate with job
    • BUCKETURL: URL to locate S3 bucket
    • SG_ID: security group ID
    Example:
    float submit --template nextflow:jfs -n tjfs -e BUCKET=https://juicyfsforcedric.s3.amazonaws.com --migratePolicy [disable=true] --securityGroup sg-0054f1eaadec3bc76
  • If security credentials are required to access the S3 bucket, add the following options to the float submit command.

    -e BUCKET_ACCESS_KEY={secret:KEY_NAME} -e BUCKET_SECRET_KEY={secret:SECRET_NAME}

    and replace:
    • KEY_NAME: name associated with access key ID
    • SECRET_NAME: name associated secret access key
  • Keep entering float list until the status of the job with the name JOBNAME changes to executing. Copy the ID associated with this job.
  • Retrieve the ssh key for this host by entering the following command.
    float secret get JOB_ID_SSHKEY > jfs_ssh.key
    Replace JOB_ID: the job ID associated with this job (prepend to '_SSHKEY').
    Example:
    float secret get S2zliQLp7NnNjFeUeVjOe_SSHKEY > jfs_ssh.key
  • Change the permissions on the ssh key file by entering the following.
    chmod 600 jfs_ssh.key
  • Establish an ssh session with the Nextflow host by entering the following.
    ssh -i jfs_ssh.key USER@NEXTFLOW_HOST_IP
    Replace:
    • USER: username of the user who submitted the job to create the Nextflow host. If admin submitted the job, use root.
    • NEXTFLOW_HOST_IP: public IP address of the Nextflow host.
  • Check that you are in the correct working directory, that the environment variables are set, and that the configuration template is available.
    # pwd
    /mnt/jfs/nextflow
    # env|grep HOME
    HOME=/mnt/jfs/nextflow
    NXF_HOME=/mnt/jfs/nextflow
    # ls
    mmcloud.config.template
  • Make a copy of the template file by entering the following.
    cp mmcloud.config.template mmcloud.config
  • Edit the config file as follows.
    # cat mmcloud.config
    plugins {
      id 'nf-float'
    }
    
    workDir = '/mnt/jfs/nextflow'
    
    process {
        executor = 'float'
        errorStrategy = 'retry'
        extra ='  --dataVolume [opts=" --cache-dir /mnt/jfs_cache "]jfs://NEXTFLOW_HOST_PRIVATE_IP:6868/1:/mnt/jfs --dataVolume [size=120]:/mnt/jfs_cache'
    }
    
    podman.registry = 'quay.io'
    
    float {
        address = 'OPCENTER_PRIVATE_IP:443'
        username = 'USERNAME'
        password = 'PASSWORD'
    }
    
    // AWS access info if needed
    aws {
      client {
        maxConnections = 20
        connectionTimeout = 300000
      }
    /*
      accessKey = 'BUCKET_ACCESS_KEY'
      secretKey = 'BUCKET_SECRET_KEY'
    */
    }
    Replace:
    • NEXTFLOW_HOST_PRIVATE_IP: private IP address of the Nextflow host.
    • OPCENTER_PRIVATE_IP: private IP address of the OpCenter.
    • USERNAME and PASSWORD: credentials to log in to the OpCenter.
    • If needed, uncomment the block containing the S3 bucket credentials and insert values for BUCKET_ACCESS_KEY and BUCKET_SECRET_KEY.

You are now ready to submit a Nextflow pipeline following the usual procedure.

Important: Upon completion, each Nextflow pipeline leaves a working directory and other related files and directories in the JuiceFS file system, which maps to many small data chunks in the specified S3 bucket. When the Nextflow host is deleted, these data chunks remain in the S3 bucket, but are not readable. It is recommended that you periodically delete the working directory and related files and directories. Delete all files and directories before deleting the Nextflow host.

Example: Running an nf-core/sarek pipeline

  • Sign in to Nextflow Tower. If you do not have an account, follow the instructions to register.
  • Create an access token using the procedure described here. Copy the access token to your clipboard.
  • Start a tmux session by entering the following.
    tmux new -s SESSION_NAME
    Replace SESSION_NAME with name to associate with tmux session.
    Example:
    tmux new -s nfjob
    Note: If the ssh session disconnects, re-establish the connection and reattach to the tmux session by entering the following.
    tmux attach -t SESSION_NAME
  • At the terminal prompt, enter:
    export TOWER_ACCESS_TOKEN=eyxxxxxxxxxxxxxxxQ1ZTE=
    where eyxxxxxxxxxxxxxxxQ1ZTE= is the access token you copied to the clipboard.
  • Run the pipeline by entering the following command.
    nextflow run nf-core/sarek -c mmcloud.config -profile test_full --outdir 's3://OUTPUT_BUCKET' -cache false -with-tower
    Replace OUTPUT_BUCKET with the S3 bucket (or S3 bucket/folder) where output is written to (you must have rw access to this bucket).
  • Open a browser and go to the Tower monitoring console.
  • Click the Runs tab and select your job.
  • (Optional) When the pipeline completes, delete the working directory and related files.
    Example:
    rm -r 0*
    rm *.tsv
Before deleting the Nextflow host, delete all files in the JuiceFS file system and then unmount the JuiceFS file system by entering the following commands.
rm -rf /mnt/jfs/*
umount /mnt/jfs

To Specify Fusion

Fusion is only used with Nextflow pipelines. Using Nextflow with MMCloud requires nf-float, the Nextflow plugin for MMCloud. A description of nf-float, its use, and the configuration changes required to use Fusion, are available here.

When Fusion is used with MMCloud, SpotSurfer and WaveRider are not supported. To turn off WaveRider and to specify On-demand Instances when using Fusion, use the following Nextflow configuration file.
plugins {
  id 'nf-float'
}

workDir = 's3://S3_BUCKET'

// limit concurrent requests sent to MMCE
// by default, it's 100
executor {
    queueSize = 20
}

podman.registry = 'quay.io'

process {
    executor = 'float'
    errorStrategy = 'retry'
    cpus = 2
    memory = '4 GB'
}

wave {
  enabled = true
}

fusion {
  enabled                  = true
  exportStorageCredentials = true
  exportAwsAccessKeys      = true
}

float {
    address = 'OPCENTER_PRIVATE_IP'
    username = 'USERNAME'
    password = 'PASSWORD'
    vmPolicy = [
        onDemand: true,
        retryLimit: 3,
        retryInterval: '10m'
    ]
    migratePolicy = [
        disable: true
    ]
}

aws {
    accessKey = 'BUCKET_ACCESS_KEY'
    secretKey = 'BUCKET_SECRET_KEY'
}