Memory Machine CE Architecture

Design

Memory Machine CE enables users to run containerized applications on virtual machines leased from Cloud Service Providers. The virtual machines are usually Spot Instances but they can also be On-demand Instances (this is a per-job configuration option). All functions are controlled by the Memory Machine CE Operations Center, which receives CLI commands from users (clients) and manages resources in the cloud, as shown in Memory Machine CE Architecture.

Figure 1. Memory Machine CE Architecture

Components

The Memory Machine CE architecture includes the following components.

Users (Clients)
Using the Memory Machine CE GUI or the Memory Machine CE CLI in a terminal shell, clients interact with the Memory Machine CE OpCenter
Memory Machine CE OpCenter
This provides the core functionality that allows Memory Machine CE to marshal resources for starting workloads and to migrate workloads if needed. If the OpCenter is not running, currently executing jobs continue (but are not migrated) and new jobs are not scheduled.
Application Library
A container image registry is a service for hosting and distributing images. The default registry for Docker images is Docker Hub. A repository is a collection of images within a registry (one registry hosts many repositories). A private repository requires a username and access token to post or retrieve images; a public repository does not. The Application Library contains a database of information for accessing container images in various repositories (public and private).
Worker Nodes
These are the compute engines provided by virtual machines running in the Cloud Service Provider's network. Worker nodes may have locally-mounted file systems and attached storage. On-demand Instances or Spot Instances can be used as worker nodes.

Workflow

The operation of Memory Machine CE proceeds as follows. A client submits a job to the OpCenter, using CLI command options (the GUI can generate the CLI command line for you) to select a container image from the Application Library and to specify the compute resources needed. The OpCenter uses this information to orchestrate the necessary resources in the Cloud Service Provider's network and schedules the job for execution. The cloud resources always include a compute node and may include block storage and file systems as well. One or more data sets usually accompany a job. The user-provided job script describes how these data sets are accessed, for example, by copying data from an AWS S3 bucket. The job script also describes where the output is placed — results are usually written to a persistent file system. When the job has run to completion, the user retrieves the results.

Workload Continuity

Using the AppCapsule feature, OpCenter automatically moves a job running on a Spot Instance to a new Spot Instance if the first Spot Instance is reclaimed, as illustrated in Memory Machine CE Operation.

In the example shown, the job starts executing on Spot Instance A. If the Cloud Service Provider signals that it intends to reclaim Spot Instance A, the OpCenter triggers the AppCapsule feature to capture the state of the running job and export the checkpoint image to persistent storage. A new Spot Instance is started (Spot Instance B in this case), the checkpoint image is imported from persistent storage, and the job resumes execution.

The job continues to run in this manner until completion, at which point the user retrieves the final results.

Workload Mobility

For jobs running on Spot Instances, job migration occurs automatically if the Spot Instance is reclaimed. The new Spot Instance is of the same type as the one that was reclaimed.

Using the CLI, you can manually migrate a job from one virtual machine to another, for example, from a Spot Instance to an On-demand Instance of a different type. Using CLI command options, you can specify the new On-demand Instance by the instance type (for example, c6xi.large in AWS) or by specifying ranges of memory size and number of virtual CPUs.

Workload mobility is useful for "right-sizing" compute platforms: jobs can pass through several execution stages where the compute requirements are different; for example, one stage needs memory optimization whereas another stage needs compute optimization. Workload mobility moves the job from one compute platform to another as the resource demands change.

Job migration can be initiated in three ways:

Manually, using CLI commands
Automatically, using a rules-based policy driven by resource utilization, for example CPU or memory utilization.
Programatically, by inserting float migrate commands at breakpoints specified in the job script, for example, after loading data.

Licensing

MemVerge maintains a license server where you request licenses for MemVerge products. Each Cloud Service Provider has its own mechanism for supporting the BYOL (Bring Your Own License) model. For example, AWS maintain its own License Manager so that you can apply licenses granted by third-party software vendors to applications you run in AWS's network. After the OpCenter license is granted, the MemVerge license server uses an API to load the license into the AWS License Manager, where you must accept and activate the license.

For some Cloud Service Providers, you must apply a license key or upload a license file directly, without using a license manager in the Cloud Service Provider's network.