Working with Snapshots (Checkpoint-Restore)¶
In the MemVerge AI Platform, Snapshots, also called Checkpoints, allow you to pause and resume a Workspace without losing the current state of a running job, such as an AI training process. This quick start guide demonstrates how to create a volume, launch a Jupyter Notebook environment for training, and safely pause and resume the Workspace while preserving progress.
1. Create a New Storage Volume¶
- Open the Volumes Dashboard
- In the left navigation, select Storage → Volumes.
- Click + New Volume.
- Specify Volume Details
- Enter a descriptive name (e.g.,
training-data-vol
) and select an appropriate Storage Class. - Allocate the required Size in GiB, and set the Access Mode (e.g.,
ReadWriteOnce
). - Confirm Creation
- Click Create and verify the volume appears in the list with a Bound status.
2. Create a New Workspace¶
- Projects → + Create Workspace
- In the Projects view, or in the Workspaces area if available, click + Create Workspace.
- Configure the Workspace
- Give it a name (e.g.,
jupyter-ml-lab
). - Assign it to your newly created volume to store code and output.
- Choose a Compute Resource profile (number of GPUs, CPU cores, and memory).
- Enable Checkpointing if prompted, ensuring the Workspace can be paused and resumed safely.
- Create
- Click Create. Wait for the Workspace status to become Ready.
3. Start a Simple Training Job¶
- Open Jupyter Notebook
- Under Workspaces, click Connect for your
jupyter-ml-lab
Workspace. - Jupyter Notebook (or a terminal) opens in a new browser tab.
- Run Your Code
- (Placeholder for Python code snippet.)
- This code will kick off a simple training loop that logs progress (e.g., epoch count) to the console or notebook cells.
4. Stop the Running Workspace¶
- Navigate to the Workspaces Dashboard
- Return to the MemVerge AI platform UI; click Workspaces in the left navigation.
- Initiate Stop
- Locate your running
jupyter-ml-lab
Workspace. - Click Stop and confirm in the popup dialog.
- Wait for Status
- The Workspace transitions from Ready to NotReady.
- Check your notebook page; the training job halts once the Workspace is fully stopped.
5. Resume the Workspace¶
- Click “Resume”
- In the Workspaces dashboard, click Resume next to the stopped Workspace.
- Wait for the Workspace status to change from NotReady back to Ready.
- Reopen or Refresh
- If the Jupyter tab is still open, refresh the page. Otherwise, click Connect to open a new tab.
- The training job automatically resumes from the last epoch due to checkpointing.
6. Verify Training Progress¶
- Continued Logging: Upon refreshing, you’ll see the training log picking up from where it left off.
- State Persistence: All data, code, and model checkpoints remain intact within the attached volume, ensuring no progress is lost.