Skip to content

Spot Instances

The following is a list of possible issues you may encounter when running jobs on spot instances.

My spot instances keep on getting evicted, increasing wall clock time noticeably. Is this avoidable?

Spot instance availability is subject to supply and demand. In periods of high demand, "spot reclaim storms" may occur. You have several options for dealing with this.

  • Move jobs to spot instances with more availability
  • Use an OpCenter in a region with more spot instance availability
  • Move to on-demand instances

Is checkpoint and restore supported on GPU-enabled instances?

Yes, as long as the GPU-enabled instances are supported by MMCloud.

My spot instances on GCP do not recover when the instance is reclaimed. Why not?

GCP reclaims spot instances with thirty seconds (or less) warning. This is not enough time to snapshot the memory space. Use periodic snapshots on GCP.

AWS reclaimed the spot instance my job was running on, but the checkpoint and restore was not successful. Why not?

While SpotSurfer works in most situations, there are cases where the checkpoint or the restore fails. The cause can be one of many.

  • Application incompatibility: some applications have unusual memory or file structures that conflict with the checkpoint or restore processes.
  • Overly large memory footprint: if the memory footprint is so large that it cannot be checkpointed in the reclaim window, the checkpoint fails. In these cases, use periodic or incremental snapshots.
  • Spot reclaim storm: excessive spot reclaims can reduce spot instance availability to the point where a new instance can't be found or the new instance is reclaimed before the restore process completes.