Submitting a "Hello World" Job Using AWS Batch with MMBatch Installed

For this hello world job we expect a Compute Environment backed by the Launch Template which installs the agent and mounts the checkpoint directory /mmc-checkpoint.

Create Job Definition

Go to the Job Definiton Console and create a new Job Definition.

JD List

Pick EC2 as orchestration type.

JD OrchType

Pick a name (e.g. loop) and set Job attempts to allow for retries to happen.

GeneralFilled

JD General

JD General filled

If you want to only retry in case an EC2 Spot reclaim is happening you can add a retry strategy.

EC2 Host reclaimexit on everything else

spot reclaim

exit on everything else

On the Step 2: Container configuration page, change the image name to registry.gitlab.com/qnib-pub-containers/qnib/loop:0.0.3 and remove the Command (we'll use the default for the container image).

Step2 jd

Step through the rest of the creation steps and keep the defaults. At the end of the you'll end of on the details page of the Job Definition. Click Actions and submit a new job.

JD Details

Submit a job

Provide a job name (e.g. loop-1) and pick a Job queue. Use the defaults for the rest of the wizard.

job name

Monitor Job

In case you are able to log into the instance, you can see the the iteration take place (the cloudwatch logs are eventually consistent - they will trickle in).

job start

The MemVerge logs in /var/log/memverge/pagent.log will eventually report a sucessful checkpoint.

job checkpoint

Initiate Reclaim

Head over to Spot Requests, find the active spot request for your EC2 instance and initiate an interruption.

Spot interruption

After the fault injection is completed, the agent on the instance will detect the EC2 Spot Interruption Warning and initiate another checkpoint to finalize the snapshot.

time="2025-01-27T13:58:08.197Z" level=warning msg="the spot instance is going to be interrupted"
time="2025-01-27T13:58:08.197Z" level=warning msg="triggers final checkpoint for all containers"

Restore

Once a new instance is up the job is going to be scheduled on a new instance. Since the checkpoint directory holds a valid checkpoint for the given job, the runtime will restore the checkpoint.

{"level":"info","msg":"(JobID: a333a02c-3182-45ce-8e15-c4a56be39459) Restoring container 577f6515f8ee23a18abeaefbd7b1e7d797d6b287b7845a2ddc416a5cf5360cc7 from /mmc-checkpoint/a333a02c-3182-45ce-8e15-c4a56be39459","time":"2025-01-27T14:06:37Z"}
{"level":"info","msg":"/opt/memverge/mmcloud-engine/libexec/tar --xattrs -I /opt/memverge/mmcloud-engine/libexec/pzstd.sh -xpf /mmc-checkpoint/a333a02c-3182-45ce-8e15-c4a56be39459/3/rootfs-diff.tar -C /var/lib/docker/overlay2/af51975d2cbad60a704880ec942ae98cf62460e1914a4b475e9579cd2f5e8479/merged",
"time":"2025-01-27T14:06:37Z"}
{"level":"info","msg":"(JobID: a333a02c-3182-45ce-8e15-c4a56be39459) Successfully restored container 577f6515f8ee23a18abeaefbd7b1e7d797d6b287b7845a2ddc416a5cf5360cc7 from /mmc-checkpoint/a333a02c-3182-45ce-8e15-c4a56be39459","time":"2025-01-27T14:06:38Z"}

By checking the logs of the container you can verify that it does not start the loop from scratch.

$ docker logs 577f6515f8ee
Iteration 46
Iteration 47