Submitting a "Hello World" Job Using AWS Batch with MMBatch Installed
For this hello world job we expect a Compute Environment backed by the Launch Template which installs the agent and mounts the checkpoint directory /mmc-checkpoint
.
Create Job Definition
Go to the Job Definiton Console and create a new Job Definition.
Pick EC2 as orchestration type.
Pick a name (e.g. loop
) and set Job attempts to allow for retries to happen.
If you want to only retry in case an EC2 Spot reclaim is happening you can add a retry strategy.
On the Step 2: Container configuration page, change the image name to registry.gitlab.com/qnib-pub-containers/qnib/loop:0.0.3
and remove the Command
(we'll use the default for the container image).
Step through the rest of the creation steps and keep the defaults. At the end of the you'll end of on the details page of the Job Definition. Click Actions and submit a new job.
Submit a job
Provide a job name (e.g. loop-1) and pick a Job queue. Use the defaults for the rest of the wizard.
Monitor Job
In case you are able to log into the instance, you can see the the iteration take place (the cloudwatch logs are eventually consistent - they will trickle in).
The MemVerge logs in /var/log/memverge/pagent.log
will eventually report a sucessful checkpoint.
Initiate Reclaim
Head over to Spot Requests, find the active spot request for your EC2 instance and initiate an interruption.
After the fault injection is completed, the agent on the instance will detect the EC2 Spot Interruption Warning and initiate another checkpoint to finalize the snapshot.
time="2025-01-27T13:58:08.197Z" level=warning msg="the spot instance is going to be interrupted"
time="2025-01-27T13:58:08.197Z" level=warning msg="triggers final checkpoint for all containers"
Restore
Once a new instance is up the job is going to be scheduled on a new instance. Since the checkpoint directory holds a valid checkpoint for the given job, the runtime will restore the checkpoint.
{"level":"info","msg":"(JobID: a333a02c-3182-45ce-8e15-c4a56be39459) Restoring container 577f6515f8ee23a18abeaefbd7b1e7d797d6b287b7845a2ddc416a5cf5360cc7 from /mmc-checkpoint/a333a02c-3182-45ce-8e15-c4a56be39459","time":"2025-01-27T14:06:37Z"}
{"level":"info","msg":"/opt/memverge/mmcloud-engine/libexec/tar --xattrs -I /opt/memverge/mmcloud-engine/libexec/pzstd.sh -xpf /mmc-checkpoint/a333a02c-3182-45ce-8e15-c4a56be39459/3/rootfs-diff.tar -C /var/lib/docker/overlay2/af51975d2cbad60a704880ec942ae98cf62460e1914a4b475e9579cd2f5e8479/merged",
"time":"2025-01-27T14:06:37Z"}
{"level":"info","msg":"(JobID: a333a02c-3182-45ce-8e15-c4a56be39459) Successfully restored container 577f6515f8ee23a18abeaefbd7b1e7d797d6b287b7845a2ddc416a5cf5360cc7 from /mmc-checkpoint/a333a02c-3182-45ce-8e15-c4a56be39459","time":"2025-01-27T14:06:38Z"}
By checking the logs of the container you can verify that it does not start the loop from scratch.