New Features in MMCloud Bondi 1.2 Release
Date Released
Released on 01-13-2023
Supported Clouds
New Features
General
- WaveRider
- Zero-loss migration of a running job from one virtual machine to another virtual machine.
- Automatic: migration triggered by rules-based policy configured per job (policy based on CPU and memory utilization thresholds).
- Programmatic: migration occurs at specific program breakpoints defined by job script.
- Ad hoc: migration occurs when initiated manually using CLI commands.
- Integration with Nextflow
Nextflow is a programming DSL (domain-specific language) widely used by data scientists to construct data-intensive computational pipelines. With the included plug-in, Nextflow scripts can directly engage MMCloud as an execution platform.
- Additional commands to display information about running jobs
- float ps -j <job_id> shows the complete process tree associated with the job identified by <job_id>
- float top shows utilization metrics for all running jobs (similar to linux top command)
- Modifying configuration while job is runningA subset of the float submit options can be modified while the job is running. The options are:
- vmPolicy — change the policy for selecting a VM instance if the job is "floated"
- migratePolicy — turn the policy on or off, or adjust the parameters that define the policy
- securityGroup — apply a firewall rule (security group) to the host running the container
- extraOptions — add special instructions to the checkpoint process
- Log to record significant job events
A new log, called job.events, records significant events associated with a job.
User Experience Improvements
- Progress bar displayed while uploading image or upgrading the float binary.
- System configuration parameter reset to default value if the string "default" used as the parameter value, that is, float config set <parameter> default returns <parameter> to its default value.
- Additional filters for use with float squery, for example, float squeue -f image=rstudio
- Additional filters for use with float sinfo/host, for example, float sinfo -f job=<job_id>
Platform Improvements
- Support for AWS EC2 rebalance recommendation
An EC2 rebalance recommendation is a signal from AWS that your Spot Instance is at a higher risk of interruption. The signal is sent before the two-minute interruption signal so that you have more time to checkpoint and migrate your job to a new VM instance if you choose. The OpCenter always interprets the EC2 rebalance recommendation signal as a trigger to move the job to another virtual machine.
- Support image retrieval from the Amazon Elastic Container Registry (Amazon ECR) and from the Alibaba Cloud Container Registry (ACR).
- Additional container images in the AppLibrary:
- bismarck
- jupyter_server
- pantools
- rstudio
- supernova
- Support the options --cpu <minCPU>:<maxCPU> and --mem <minMem>:<maxMem> when float submit is used with AliCloud.
- Command line option for setting polling interval to retrieve
container metrics
Container metrics are roughly equivalent to the metrics available for any linux process. The --metricsInterval <interval_in_seconds> option can be used with float submit to adjust the polling interval from its default value of 10s.
- Optimize disk costs for On-demand Instances
Spot Instances can be withdrawn with nominal warning (two minutes in AWS). In that time, the container must be checkpointed and the checkpoint image written to disk. To minimize set-up time, a data volume (disk) is mounted when the Spot Instance starts. An On-demand Instance is not under this time pressure, so the data volume can be mounted if and when a request to migrate is received. This minimizes the costs associated with data volumes for On-demand Instances.
License Server
- Request for license to use in AWS China region available from the Memory Machine Cloud
Edition tab.
Check the box marked AWS China Region.
- Status updates sent by MemVerge license server to email address of user requesting license, for example, a reminder to accept license if license in pending state for 24 hours.