Preempting Jupyter Notebooks¶

In this section, we will run through the steps to preempt a Jupyter notebook. We will use the same example as in the Quick Start section, where we have two users, userA and userB. UserA will create a Jupyter notebook, and userB will preempt userA.

Red User¶

Open a new browser window and log in as user-red.

Create Volume¶

Head over to storage.

Create a default volume which will be used by the Jupyter notebook.

Create Workspace¶

Once that's done go to workspaces.

Create a Jupyter Notebook¶

Create a new workspace, provide a name and select a project (you should only have one project available).
Use Jupyter + PyTorch + Cuda as an environment.

Create WS Top	Create WS Bottom

After a couple of seconds after creation you will be able to access the Jupyter notebook.

Connect to Jupyter Notebook¶

Connect to the Jupyter notebook by clicking on the link connect. You'll be greeted by the Launcher.
Select the Python 3 notebook.

Copy in the code below and run it. This will start training a simple CNN model.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import time, datetime

# Use GPU if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Define a synthetic dataset of random images and labels.
class RandomDataset(data.Dataset):
    def __init__(self, num_samples, input_size, num_classes):
        self.num_samples = num_samples
        self.input_size = input_size
        self.num_classes = num_classes

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        # Create a random image and a random label
        x = torch.randn(self.input_size)
        y = torch.randint(0, self.num_classes, (1,)).item()
        return x, y

# A simple CNN model.
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),  # (B,32,64,64)
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),                             # (B,32,32,32)
            nn.Conv2d(32, 64, kernel_size=3, padding=1), # (B,64,32,32)
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2)                              # (B,64,16,16)
        )
        # Flatten and classify
        self.classifier = nn.Sequential(
            nn.Linear(64 * 16 * 16, 128),
            nn.ReLU(inplace=True),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.classifier(x)
        return x

# Parameters
num_samples = 10000         # Total number of synthetic samples
input_size = (3, 64, 64)      # 3-channel images of size 64x64
num_classes = 10
batch_size = 32

# Create dataset and dataloader.
dataset = RandomDataset(num_samples, input_size, num_classes)
dataloader = data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Initialize the model, loss function, and optimizer.
model = SimpleCNN(num_classes=num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

dur = 45
# Set the maximum training duration to 15 minutes (900 seconds).
max_duration = dur * 60  
start_time = time.time()
epoch = 0

print(f"Starting training for {dur} minutes...")
while time.time() - start_time < max_duration:
    epoch += 1
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(dataloader):
        # Move data to the GPU.
        inputs = inputs.to(device)
        labels = labels.to(device)

        # Zero the parameter gradients.
        optimizer.zero_grad()
        # Forward pass.
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        # Backward pass and optimize.
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        # Break if we've reached 15 minutes.
        if time.time() - start_time >= max_duration:
            break


    # Inside the training loop:
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
    print(f"{timestamp} - Epoch {epoch} completed, Average Loss: {running_loss/(i+1):.4f}")

print(f"Training completed in {dur} minutes.")

Run the Training¶

Once it starts it will count the epochs and print the average loss.

On the admin dashboard you'll see the GPU usage going up.

Blue User¶

Open a new browser window (different browser or private session) and log in as user-blue.

Go through the steps we went through with the red user to create a Jupyter notebook.

After you created the workspace, you'll see on the red session that the connection was lost.

That's due to the higher priority of the blue user. The red user will have to wait until the blue user is done with the training.

Continue Red User¶

Once the blue user is done with the training and stops the workspace, the red user will be able to resume the training.

Question

The workshop needs to be started manually, is there a way to autostart since it was preempted? Or at least get a notification that the workspace was preempted and is now able to be restarted.

Once the notebook is reconnected you'll see that the training continues where it left off.

As you can see on the output, there's a 10min gap in the epochs output.

Question

The gap in epoch numbers is due to the reconnect lack of the notebook, is it? The stdout is lost.