SpotSurfer

SpotSurfer removes the risk of running stateful applications on Spot Instances while reducing costs significantly.

Feature Description

Cloud Service (CSPs) sell Spot Instances at a deep discount compared to On-demand Instances. The drawback is that the CSP can reclaim any Spot Instance with only nominal warning (usually, two minutes or less). This is a problem for stateful applications where all progress is lost if the job does not run to completion.

SpotSurfer is a feature that allows stateful applications to always run to completion by "surfing" to a new VM instance if the underlying Spot Instance is reclaimed by the CSP. Customers get the cost savings of Spot Instances without the risk of a job restarting from the beginning if the Spot Instance is reclaimed.

The technology that makes SpotSurfer possible is AppCapsule, Memory Machine CE's checkpoint/restore (C/R) capability. The AppCapsule is a moment-in-time snapshot of the application instance, including in-memory state and relevant files, which allows the workload to move to a new virtual machine and resume running. AppCapsule underlies both SpotSurfer and WaveRider — two independent mobility features that can be enabled simultaneously for a given job.

SpotSurfer is automatically enabled (disabled) if the job runs an a Spot Instance (On-demand Instance). AppCapsule capture is triggered automatically when the CSP signals that it is reclaiming the Spot Instance. Job execution pauses and then resumes on a new Spot Instance.

Operation

To use SpotSurfer, you don't have to do anything. SpotSurfer's default behavior is to intercept the Spot Instance reclaim signal, create the AppCapsule, instantiate a new Spot Instance of the same type as the instance that is being reclaimed, and recover the job in the new instance.

Using the OpCenter or the CLI, it is possible to modify the VM creation policy while a job is running. If the VM creation policy is modified, the new instance is of the type specified by the new policy. In most cases, the new policy specifies a Spot Instance but it could specify an On-demand Instance.

SpotSurfer for Memory-optimized Applications

In its default mode, SpotSurfer captures a snapshot of the current state of a running job only once — immediately prior to migrating the job to a new virtual machine instance. If a Spot Instance reclaim event causes the job to migrate, the short time available to save the snapshot prevents SpotSurfer from working with applications that use a lot of memory, for example, larger than 128 GB.

If periodic snapshots are enabled (set snapshotInterval to at least 10m when submitting a job), snapshots are taken at fixed time intervals — intervals long enough to allow a snapshot to complete, that is, an AppCapsule captured and written to persistent storage. If a snapshot cannot complete in the Spot Instance reclaim time window, the most recent complete snapshot is used to recover the job in the new virtual machine. This means that all job progress since the last snapshot is lost.

When OpCenter runs on AWS, additional measures are available to handle memory-optimized applications. If AWS determines that a Spot Instance has an increased likelihood of being reclaimed, AWS may, at their discretion, send an AWS Rebalance Recommendation signal. The OpCenter intercepts the signal and uses a rules-based approach to decide whether to keep the current Spot Instance or to pro-actively capture a snapshot and move to a new Spot Instance. If the decision is to move, there is more time than in the normal reclaim window in which to complete the snapshot. This allows an application with a memory footprint larger than 128 GB to run without interruption on a Spot Instance.