Working with Snapshots (Checkpoint-Restore)¶

In the MemVerge AI Platform, Snapshots, also called Checkpoints, allow you to pause and resume a Workspace without losing the current state of a running job, such as an AI training process. This quick start guide demonstrates how to create a volume, launch a Jupyter Notebook environment for training, and safely pause and resume the Workspace while preserving progress.

1. Create a New Storage Volume¶

Open the Volumes Dashboard
In the left navigation, select Storage → Volumes.
Click + New Volume.
Specify Volume Details
Enter a descriptive name (e.g., training-data-vol) and select an appropriate Storage Class.
Allocate the required Size in GiB, and set the Access Mode (e.g., ReadWriteOnce).
Confirm Creation
Click Create and verify the volume appears in the list with a Bound status.

2. Create a New Workspace¶

Projects → + Create Workspace
In the Projects view, or in the Workspaces area if available, click + Create Workspace.
Configure the Workspace
Give it a name (e.g., jupyter-ml-lab).
Assign it to your newly created volume to store code and output.
Choose a Compute Resource profile (number of GPUs, CPU cores, and memory).
Enable Checkpointing if prompted, ensuring the Workspace can be paused and resumed safely.
Create
Click Create. Wait for the Workspace status to become Ready.

3. Start a Simple Training Job¶

Open Jupyter Notebook
Under Workspaces, click Connect for your jupyter-ml-lab Workspace.
Jupyter Notebook (or a terminal) opens in a new browser tab.
Run Your Code
(Placeholder for Python code snippet.)
This code will kick off a simple training loop that logs progress (e.g., epoch count) to the console or notebook cells.

4. Stop the Running Workspace¶

Navigate to the Workspaces Dashboard
Return to the MemVerge AI platform UI; click Workspaces in the left navigation.
Initiate Stop
Locate your running jupyter-ml-lab Workspace.
Click Stop and confirm in the popup dialog.
Wait for Status
The Workspace transitions from Ready to NotReady.
Check your notebook page; the training job halts once the Workspace is fully stopped.

5. Resume the Workspace¶

Click “Resume”
In the Workspaces dashboard, click Resume next to the stopped Workspace.
Wait for the Workspace status to change from NotReady back to Ready.
Reopen or Refresh
If the Jupyter tab is still open, refresh the page. Otherwise, click Connect to open a new tab.
The training job automatically resumes from the last epoch due to checkpointing.

6. Verify Training Progress¶

Continued Logging: Upon refreshing, you’ll see the training log picking up from where it left off.
State Persistence: All data, code, and model checkpoints remain intact within the attached volume, ensuring no progress is lost.