New Features in MMCloud Bondi 1.2 Release

Date Released

Released on 01-13-2023

Supported Clouds

MMCloud is designed to work on any cloud infrastructure. The Bondi release supports the following clouds:

New Features

General

  • WaveRider
    • Zero-loss migration of a running job from one virtual machine to another virtual machine.
    • Automatic: migration triggered by rules-based policy configured per job (policy based on CPU and memory utilization thresholds).
    • Programmatic: migration occurs at specific program breakpoints defined by job script.
    • Ad hoc: migration occurs when initiated manually using CLI commands.
  • Integration with Nextflow

    Nextflow is a programming DSL (domain-specific language) widely used by data scientists to construct data-intensive computational pipelines. With the included plug-in, Nextflow scripts can directly engage MMCloud as an execution platform.

  • Additional commands to display information about running jobs
    • float ps -j <job_id> shows the complete process tree associated with the job identified by <job_id>
    • float top shows utilization metrics for all running jobs (similar to linux top command)
  • Modifying configuration while job is running
    A subset of the float submit options can be modified while the job is running. The options are:
    • vmPolicy — change the policy for selecting a VM instance if the job is "floated"
    • migratePolicy — turn the policy on or off, or adjust the parameters that define the policy
    • securityGroup — apply a firewall rule (security group) to the host running the container
    • extraOptions — add special instructions to the checkpoint process
  • Log to record significant job events

    A new log, called job.events, records significant events associated with a job.

User Experience Improvements

  • Progress bar displayed while uploading image or upgrading the float binary.
  • System configuration parameter reset to default value if the string "default" used as the parameter value, that is, float config set <parameter> default returns <parameter> to its default value.
  • Additional filters for use with float squery, for example, float squeue -f image=rstudio
  • Additional filters for use with float sinfo/host, for example, float sinfo -f job=<job_id>

Platform Improvements

  • Support for AWS EC2 rebalance recommendation

    An EC2 rebalance recommendation is a signal from AWS that your Spot Instance is at a higher risk of interruption. The signal is sent before the two-minute interruption signal so that you have more time to checkpoint and migrate your job to a new VM instance if you choose. The OpCenter always interprets the EC2 rebalance recommendation signal as a trigger to move the job to another virtual machine.

  • Support image retrieval from the Amazon Elastic Container Registry (Amazon ECR) and from the Alibaba Cloud Container Registry (ACR).
  • Additional container images in the AppLibrary:
    • bismarck
    • jupyter_server
    • pantools
    • rstudio
    • supernova
  • Support the options --cpu <minCPU>:<maxCPU> and --mem <minMem>:<maxMem> when float submit is used with AliCloud.
  • Command line option for setting polling interval to retrieve container metrics

    Container metrics are roughly equivalent to the metrics available for any linux process. The --metricsInterval <interval_in_seconds> option can be used with float submit to adjust the polling interval from its default value of 10s.

  • Optimize disk costs for On-demand Instances

    Spot Instances can be withdrawn with nominal warning (two minutes in AWS). In that time, the container must be checkpointed and the checkpoint image written to disk. To minimize set-up time, a data volume (disk) is mounted when the Spot Instance starts. An On-demand Instance is not under this time pressure, so the data volume can be mounted if and when a request to migrate is received. This minimizes the costs associated with data volumes for On-demand Instances.

License Server

  • Request for license to use in AWS China region available from the Memory Machine Cloud Edition tab.

    Check the box marked AWS China Region.

  • Status updates sent by MemVerge license server to email address of user requesting license, for example, a reminder to accept license if license in pending state for 24 hours.