SpotSurfer

SpotSurfer removes the risk of running stateful applications on Spot Instances while reducing costs significantly.

Feature Description

Cloud Service (CSPs) sell Spot Instances at a discount compared to On-demand Instances. The drawback is that the CSP can reclaim any Spot Instance with only nominal warning (usually, two minutes or less). This is a problem for stateful applications without checkpoint capability where all progress is lost if the job does not run to completion.

SpotSurfer is a feature that allows stateful applications to always run to completion by "surfing" to a new VM instance if the underlying Spot Instance is reclaimed by the CSP. Customers realize the cost savings of Spot Instances without the risk of a job restarting from the beginning if the Spot Instance is reclaimed.

The technology that makes SpotSurfer possible is AppCapsule, MMCloud's checkpoint/restore (C/R) capability. The AppCapsule is a moment-in-time snapshot of the application instance, including in-memory state and relevant files, which allows the workload to move to a new virtual machine and resume running. AppCapsule underlies both SpotSurfer and WaveRider — two independent mobility features that can be enabled simultaneously for a given job.

SpotSurfer is automatically enabled (disabled) if the job runs an a Spot Instance (On-demand Instance). AppCapsule creation is triggered automatically when the CSP signals that it is reclaiming the Spot Instance. Job execution pauses while a new Spot Instance is created and the existing data volumes remounted (the data volumes have the current state of the file systems). The job then resumes running.

Operation

To use SpotSurfer, you don't have to do anything. SpotSurfer's default behavior is to intercept the Spot Instance reclaim signal, create the AppCapsule, instantiate a new Spot Instance of the same type as the instance that is being reclaimed, and recover the job in the new instance.

Using the OpCenter or the CLI, it is possible to modify the VM creation policy while a job is running. If the VM creation policy is modified, the new instance is of the type specified by the new policy. In most cases, the new policy specifies a Spot Instance but it could specify an On-demand Instance.

Periodic Snapshots

In its default mode, SpotSurfer captures a snapshot of the current state of a running job only once — immediately prior to migrating the job to a new virtual machine instance. If a Spot Instance reclaim event causes the job to migrate, the short time available to save the snapshot prevents SpotSurfer from working with applications that use a lot of memory, for example, larger than 64 GB.

If periodic snapshots are enabled, complete snapshots (including memory and file system) are taken at fixed time intervals — intervals long enough to allow a snapshot to finish, that is, an AppCapsule created and written to persistent storage. If a snapshot cannot finish in the Spot Instance reclaim time window, the most recent complete snapshot is used to recover the job in the new virtual machine. This means that the job resumes in its state when the last (complete) snapshot was saved.

Enable periodic snapshots from the Submit Job screen in the web interface by selecting the Start from Scratch tab, clicking the Misc. tab, and setting Perodic Snapshot to On. From the CLI, use the float submit command with the --dumpMode full --snapshotInterval <interval> options. Adjust the interval between snapshots by setting <interval> to a minimum of 10m, although an interval of at least 60m is recommended.

Note: Periodic snapshots are only supported on EBS volumes.

AppCapsule++

When a periodic snapshot is captured, the application's process tree is temporarily frozen so that the memory state does not change while the snapshot is captured and written to storage. As the memory footprint grows into the hundreds of gigabytes, the time that the application is frozen can have a measurable effect on the wall-clock time of the job run. To overcome this, AppCapsule++ improves the performance of AppCapsule by incorporating the following capabilities.
  • Instead of taking a complete snapshot every time, only incremental changes are captured after the first snapshot is taken.
  • Instead of taking snapshots at fixed intervals, an incremental snapshot is only taken when the number of changed memory pages reaches a certain limit (the limit depends on the I/O bandwidth of the device on which the snapshot is stored).
  • The application and the process that captures the snapshot (and writes it to disk) are asynchronous, which means that the application continues to run while the snapshot is saved.
  • The file system is not checkpointed when the incremental snapshot of the memory state is captured.

The complete snapshot is assembled only when the migration or reclaim event is triggered.

Warning: In the case of a spot instance reclaim event, the final incremental snapshot is captured when the OpCenter receives the spot instance reclaim notice. If there is not enough time to save the final incremental snapshot, the OpCenter cannot assemble the complete snapshot. For this reason, AppCapsule++ is not supported in Google Cloud where, although the nominal warning is 30 seconds, the spot instance can in practice terminate in as little as a few seconds.

Enable AppCapsule++ on AWS from the Submit Job screen in the web interface by selecting the Start from Scratch tab, clicking the Misc. tab, and setting Incremental Snapshot to On. From the CLI, use the float submit command with the --dumpMode incremental option.

AppCapsule and AppCapsule++ features are compared in the following table.

Feature Snapshot Type Snapshot Interval File system checkpoint
Standard AppCapsule Full NA - snapshot taken only when reclaim signal received No. When reclaim signal received, file system frozen. Local storage volumes remounted on new instance.
Periodic AppCapsule Full Fixed (configurable) Yes. With every snapshot (EBS only) except when reclaim signal received.
Incremental AppCapsule++ Incremental

(Complete snapshot assembled after reclaim signal received.)

Dynamic, automatically calculated No. When reclaim signal received, file system frozen. Local storage volume remounted on new instance.

AWS Rebalance Recommendation Signal

When OpCenter runs on AWS, additional measures are available to handle memory-optimized applications. If AWS determines that a Spot Instance has an increased likelihood of being reclaimed, AWS may, at their discretion, send an AWS Rebalance Recommendation signal. The OpCenter intercepts the signal and uses a rules-based approach to decide whether to keep the current Spot Instance or to proactively capture a snapshot and move to a new Spot Instance. If the decision is to move, there is more time than in the normal reclaim window in which to complete the snapshot. This allows an application with a memory footprint larger than 64 GB to run without interruption on a Spot Instance.

The current release uses memory size as the criterion for reacting to the Rebalance Recommendation signal. If the memory size is below a threshold (default 64GB), the rebalance signal is ignored. Above that, the rebalance signal triggers a migration. You can change the default by changing the value of cloud.handleRebalanceMemThreshold in the OpCenter configuration parameters.