Programmatic Job Migration

Initiate job migration by inserting the float migrate command into the job script.

Overview

Although not strictly required, most jobs are submitted with a job script that sets up the execution environment and tells the job scheduler what to do. Some jobs proceed in a series of stages, for example, an initial stage to read in a large dataset followed by a computationally-intensive stage. These two stages may repeat several times until the computation is complete. Then the results are written out to disk.

The resource requirements (CPU, memory, and storage) for each of these stages are different. A float migrate command can be inserted at each point where the job moves to a new stage with different resource requirements. In this manner, the job runs on a virtual machine that matches the requirements for the given stage.

Example

The example shown here uses a simple shell program to demonstrate how programmatic migration can be implemented. In the example job script, the job is migrated at an arbitrary point. In a real application, the job would migrate at the point where resources were expected to change.

The contents of the job script are as follows:
cat progmigrate.sh
#!/usr/bin/env bash
LOG_PATH=$1
LOG_FILE=$LOG_PATH/output
touch $LOG_FILE
exec >$LOG_FILE 2>&1
echo "Starting to execute shell script"
echo "Job migrates when count reaches 50"
for(( c=1; c<300; c++))
do
        if [[ $(($c % 3)) == 0 ]]; then
                echo "$c is a multiple of three"
        else
                echo "$c is NOT a multiple of three" >&2
        fi
        if [[ $c == 50 ]]; then
                /opt/memverge/bin/float migrate -f --instType c5.2xlarge 
                echo "Job migration initiated, wait for a while"
                sleep 20s
        fi
        sleep 1s
done
echo "Job is complete"
The job runs on a centos9 image.
float submit -i centos9 -j ./progmigrate.sh --instType c5.large --dataVolume [size=10]:/data
The output file shows where the job migration occurs.
float log tail --follow output -j 4BZXIxWu6L5ioCSqEMD63
Starting to execute shell script
Job migrates when count reaches 50
1 is NOT a multiple of three
----[edited]
50 is NOT a multiple of three
4BZXIxWu6L5ioCSqEMD63 on i-0027c7d62eaabafa6 is now migrating, please use squeue/show to monitor migration progress. Check logs for details
Job migration initiated, wait for a while
51 is a multiple of three
----[edited]
299 is NOT a multiple of three
Job is complete
The log file shows the events occurring when migrating the job from a c5.large instance to a c5.2xlarge instance.
float log cat job.events -j 4BZXIxWu6L5ioCSqEMD63
----[edited]
2023-01-24T20:33:24.997727759Z: Ready to migrate with instance type: c5.2xlarge, cpu: 8, memory: 16, zone: us-east-1b, last instance type: c5.large(Spot)
2023-01-24T20:33:24.997929752Z: Ready to checkpoint host i-0027c7d62eaabafa6
2023-01-24T20:33:26.29553621Z: Checkpointed host i-0027c7d62eaabafa6, result: &{map[5ee5df78f592:]}, duration 1.297557077s
2023-01-24T20:33:26.515069329Z: Ready to reclaim host i-0027c7d62eaabafa6
2023-01-24T20:34:15.524359376Z: Ready to create new host to recover
2023-01-24T20:34:22.200051613Z: Reclaimed host i-0027c7d62eaabafa6
2023-01-24T20:35:55.597906896Z: Created new host: i-03fb670fbb577639e(Spot)
2023-01-24T20:35:55.694463545Z: Got 1 containers on host i-03fb670fbb577639e
2023-01-24T20:35:55.694528662Z: Ready to recover &{5ee5df78f592 false true} on host i-03fb670fbb577639e
2023-01-24T20:35:55.700293372Z: Job floated
2023-01-24T20:35:56.362953748Z: Migrated to new VM: i-03fb670fbb577639e
----[edited]