Skip to content

Preempting Jupyter Notebooks

In this section, we will run through the steps to preempt a Jupyter notebook. We will use the same example as in the Quick Start section, where we have two users, userA and userB. UserA will create a Jupyter notebook, and userB will preempt userA.

Red User

Open a new browser window and log in as user-red.

dashboard

Create Volume

Head over to storage.

storage list

Create a default volume which will be used by the Jupyter notebook.

volume creation

Create Workspace

Once that's done go to workspaces.

workspaces list

Create a Jupyter Notebook

Create a new workspace, provide a name and select a project (you should only have one project available).
Use Jupyter + PyTorch + Cuda as an environment.

Create WS Top Create WS Bottom
workspace creation top workspace creation bottom

After a couple of seconds after creation you will be able to access the Jupyter notebook.

WS created

Connect to Jupyter Notebook

Connect to the Jupyter notebook by clicking on the link connect. You'll be greeted by the Launcher.
Select the Python 3 notebook.

Launcher

Copy in the code below and run it. This will start training a simple CNN model.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import time, datetime

# Use GPU if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Define a synthetic dataset of random images and labels.
class RandomDataset(data.Dataset):
    def __init__(self, num_samples, input_size, num_classes):
        self.num_samples = num_samples
        self.input_size = input_size
        self.num_classes = num_classes

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        # Create a random image and a random label
        x = torch.randn(self.input_size)
        y = torch.randint(0, self.num_classes, (1,)).item()
        return x, y

# A simple CNN model.
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),  # (B,32,64,64)
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),                             # (B,32,32,32)
            nn.Conv2d(32, 64, kernel_size=3, padding=1), # (B,64,32,32)
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2)                              # (B,64,16,16)
        )
        # Flatten and classify
        self.classifier = nn.Sequential(
            nn.Linear(64 * 16 * 16, 128),
            nn.ReLU(inplace=True),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.classifier(x)
        return x

# Parameters
num_samples = 10000         # Total number of synthetic samples
input_size = (3, 64, 64)      # 3-channel images of size 64x64
num_classes = 10
batch_size = 32

# Create dataset and dataloader.
dataset = RandomDataset(num_samples, input_size, num_classes)
dataloader = data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Initialize the model, loss function, and optimizer.
model = SimpleCNN(num_classes=num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

dur = 45
# Set the maximum training duration to 15 minutes (900 seconds).
max_duration = dur * 60  
start_time = time.time()
epoch = 0

print(f"Starting training for {dur} minutes...")
while time.time() - start_time < max_duration:
    epoch += 1
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(dataloader):
        # Move data to the GPU.
        inputs = inputs.to(device)
        labels = labels.to(device)

        # Zero the parameter gradients.
        optimizer.zero_grad()
        # Forward pass.
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        # Backward pass and optimize.
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        # Break if we've reached 15 minutes.
        if time.time() - start_time >= max_duration:
            break


    # Inside the training loop:
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
    print(f"{timestamp} - Epoch {epoch} completed, Average Loss: {running_loss/(i+1):.4f}")

print(f"Training completed in {dur} minutes.")

Run the Training

Once it starts it will count the epochs and print the average loss.

Training

On the admin dashboard you'll see the GPU usage going up.

GPU Usage

Blue User

Open a new browser window (different browser or private session) and log in as user-blue.

Go through the steps we went through with the red user to create a Jupyter notebook.

After you created the workspace, you'll see on the red session that the connection was lost.

Connection Lost

That's due to the higher priority of the blue user. The red user will have to wait until the blue user is done with the training.

Blue user preempted red

Continue Red User

Once the blue user is done with the training and stops the workspace, the red user will be able to resume the training.

Red user resumes training

Question

The workshop needs to be started manually, is there a way to autostart since it was preempted? Or at least get a notification that the workspace was preempted and is now able to be restarted.

Once the notebook is reconnected you'll see that the training continues where it left off.

Epochs continue

As you can see on the output, there's a 10min gap in the epochs output.

Question

The gap in epoch numbers is due to the reconnect lack of the notebook, is it? The stdout is lost.