Skip to content

Working with Snapshots (Checkpoint-Restore)

In the MemVerge AI Platform, Snapshots, also called Checkpoints, allow you to pause and resume a Workspace without losing the current state of a running job, such as an AI training process. This quick start guide demonstrates how to create a volume, launch a Jupyter Notebook environment for training, and safely pause and resume the Workspace while preserving progress.

1. Create a New Storage Volume

  1. Open the Volumes Dashboard
  2. In the left navigation, select StorageVolumes.
  3. Click + New Volume.
  4. Specify Volume Details
  5. Enter a descriptive name (e.g., training-data-vol) and select an appropriate Storage Class.
  6. Allocate the required Size in GiB, and set the Access Mode (e.g., ReadWriteOnce).
  7. Confirm Creation
  8. Click Create and verify the volume appears in the list with a Bound status.

2. Create a New Workspace

  1. Projects → + Create Workspace
  2. In the Projects view, or in the Workspaces area if available, click + Create Workspace.
  3. Configure the Workspace
  4. Give it a name (e.g., jupyter-ml-lab).
  5. Assign it to your newly created volume to store code and output.
  6. Choose a Compute Resource profile (number of GPUs, CPU cores, and memory).
  7. Enable Checkpointing if prompted, ensuring the Workspace can be paused and resumed safely.
  8. Create
  9. Click Create. Wait for the Workspace status to become Ready.

3. Start a Simple Training Job

  1. Open Jupyter Notebook
  2. Under Workspaces, click Connect for your jupyter-ml-lab Workspace.
  3. Jupyter Notebook (or a terminal) opens in a new browser tab.
  4. Run Your Code
  5. (Placeholder for Python code snippet.)
  6. This code will kick off a simple training loop that logs progress (e.g., epoch count) to the console or notebook cells.

4. Stop the Running Workspace

  1. Navigate to the Workspaces Dashboard
  2. Return to the MemVerge AI platform UI; click Workspaces in the left navigation.
  3. Initiate Stop
  4. Locate your running jupyter-ml-lab Workspace.
  5. Click Stop and confirm in the popup dialog.
  6. Wait for Status
  7. The Workspace transitions from Ready to NotReady.
  8. Check your notebook page; the training job halts once the Workspace is fully stopped.

5. Resume the Workspace

  1. Click “Resume”
  2. In the Workspaces dashboard, click Resume next to the stopped Workspace.
  3. Wait for the Workspace status to change from NotReady back to Ready.
  4. Reopen or Refresh
  5. If the Jupyter tab is still open, refresh the page. Otherwise, click Connect to open a new tab.
  6. The training job automatically resumes from the last epoch due to checkpointing.

6. Verify Training Progress

  • Continued Logging: Upon refreshing, you’ll see the training log picking up from where it left off.
  • State Persistence: All data, code, and model checkpoints remain intact within the attached volume, ensuring no progress is lost.