Programmatic Job Migration
Initiate job migration by inserting the float migrate command into the job script.
Overview
Although not strictly required, most jobs are submitted with a job script that sets up the execution environment and tells the job scheduler what to do. Some jobs proceed in a series of stages, for example, an initial stage to read in a large dataset followed by a computationally-intensive stage. These two stages may repeat several times until the computation is complete. Then the results are written out to disk.
The resource requirements (CPU, memory, and storage) for each of these stages are different. A float migrate command can be inserted at each point where the job moves to a new stage with different resource requirements. In this manner, the job runs on a virtual machine that matches the requirements for the given stage.
Example
The example shown here uses a simple shell program to demonstrate how programmatic migration can be implemented. In the example job script, the job is migrated at an arbitrary point. In a real application, the job would migrate at the point where resources were expected to change.
cat progmigrate.sh #!/usr/bin/env bash LOG_PATH=$1 LOG_FILE=$LOG_PATH/output touch $LOG_FILE exec >$LOG_FILE 2>&1 echo "Starting to execute shell script" echo "Job migrates when count reaches 50" for(( c=1; c<300; c++)) do if [[ $(($c % 3)) == 0 ]]; then echo "$c is a multiple of three" else echo "$c is NOT a multiple of three" >&2 fi if [[ $c == 50 ]]; then /opt/memverge/bin/float migrate -f --instType c5.2xlarge echo "Job migration initiated, wait for a while" sleep 20s fi sleep 1s done echo "Job is complete"
float submit -i centos9 -j ./progmigrate.sh --instType c5.large --dataVolume [size=10]:/data
float log tail --follow output -j 4BZXIxWu6L5ioCSqEMD63 Starting to execute shell script Job migrates when count reaches 50 1 is NOT a multiple of three ----[edited] 50 is NOT a multiple of three 4BZXIxWu6L5ioCSqEMD63 on i-0027c7d62eaabafa6 is now migrating, please use squeue/show to monitor migration progress. Check logs for details Job migration initiated, wait for a while 51 is a multiple of three ----[edited] 299 is NOT a multiple of three Job is completeThe log file shows the events occurring when migrating the job from a c5.large instance to a c5.2xlarge instance.
float log cat job.events -j 4BZXIxWu6L5ioCSqEMD63 ----[edited] 2023-01-24T20:33:24.997727759Z: Ready to migrate with instance type: c5.2xlarge, cpu: 8, memory: 16, zone: us-east-1b, last instance type: c5.large(Spot) 2023-01-24T20:33:24.997929752Z: Ready to checkpoint host i-0027c7d62eaabafa6 2023-01-24T20:33:26.29553621Z: Checkpointed host i-0027c7d62eaabafa6, result: &{map[5ee5df78f592:]}, duration 1.297557077s 2023-01-24T20:33:26.515069329Z: Ready to reclaim host i-0027c7d62eaabafa6 2023-01-24T20:34:15.524359376Z: Ready to create new host to recover 2023-01-24T20:34:22.200051613Z: Reclaimed host i-0027c7d62eaabafa6 2023-01-24T20:35:55.597906896Z: Created new host: i-03fb670fbb577639e(Spot) 2023-01-24T20:35:55.694463545Z: Got 1 containers on host i-03fb670fbb577639e 2023-01-24T20:35:55.694528662Z: Ready to recover &{5ee5df78f592 false true} on host i-03fb670fbb577639e 2023-01-24T20:35:55.700293372Z: Job floated 2023-01-24T20:35:56.362953748Z: Migrated to new VM: i-03fb670fbb577639e ----[edited]