Workload Management¶
A workload is a collection of containers (e.g., a Kubernetes Deployment, DaemonSet, or StatefulSet) that run a particular application or service inside a workspace. Most commonly a workspace and workload has a one-to-one relationship which simplifies management. When you start a Workspace, you effectively launch its associated workload, and stopping or deleting the Workspace likewise terminates the underlying workload. For example, interactive workloads such as Jupyter Notebook or VSCode environments.
Viewing and Controlling Workloads¶
-
Workloads Dashboard
- In the Workloads section (or via the GPU Cluster Manager UI’s “Workloads” page), you’ll see a list of running workloads.
- Each row typically shows the workloads Name, Status (e.g.,
Running
,Pending
), Priority (e.g.,Lowest
,Highest
) the associated Project and Node Groups, Requested Resources, and time stamps for when the workload was Created At and Finished At.
-
Actions
- From this view, you can Stop a workload (equivalent to stopping the corresponding Workspace) by clicking on the
Stop Button
- Confirm on the pop-up screen if this is the desired action. If not, choose "Cancel" to get back to the Workloads Dashboard.
- You can also Delete a workload entirely if it’s no longer needed by clicking on the
Delete Button.
- Confirm on the pop-up screen if this is the desired action. If not, choose "Cancel" to get back to the Workloads Dashboard.
- Deleting removes the Workspace and its running container from the cluster.
- From this view, you can Stop a workload (equivalent to stopping the corresponding Workspace) by clicking on the
-
Detailed Workload Information
- Clicking on a workload often reveals logs, events, or detailed container metrics. These insights help troubleshoot issues like unexpected restarts or resource bottlenecks.
Best Practices¶
- Use Projects: Assign your workloads (and hence Workspaces) to a relevant Project for logical grouping and easier management.
- Monitor Status: Keep an eye on workload statuses in case of errors (e.g.,
CrashLoopBackOff
) that might require adjustments to your container image or resource limits. - Resource Efficiency: Stop or delete workloads that are no longer active to free cluster resources (GPUs, CPUs, memory).
- Persistent Data: Attach volumes to any workload-based session (Workspace) that needs to retain data beyond the lifecycle of the container.