Prerequisites and Considerations

When deploying MMBatch, it is important to consider key requirements and pre-requisites to ensure that MMBatch runs smoothly within the Cloud Service Provider's Batch service.

Prerequisite: Using GPU instances with MMBatch

Please note, it is now available, but optional, to use GPU instances with MMBatch. Doing so has many benefits, but it also has considerations, it's own user environment requirements, and limitations compared to CPU instances.

User Environment

Environment: NVIDIA Linux driver 570.86 and after
AWS service quotas: Make sure you have enough AWS Service Quotas for All G and VT Spot Instance Requests in the region of your deployment. Default should be 0. Recommended number of Service Quotas is based on your anticipated usage.

For example: If we are using AWS g4dn.4xlarge instance, which has 16 vCPU. Since we’ll need at least 2 instances (old instance and the new instance to migrate to), we should have at least 32 vCPUs in quota.
MMBatch Version: 1.4 and later

Considerations and Limitations

When deploying GPU instances for MMBatch, please consider the following:

Limitation: Single GPU checkpoint and restore. Checkpoint and restore will need to be on the same GPU model.

For example: Checkpoint on g4dn.2xlarge (NVIDIA T4) can be restored on g4dn.4xlarge (NVIDIA T4) but can NOT be restored on g5.2xlarge (NVIDIA A10G).
Consideration: Space overhead. System memory needs to preserved with the same amount of GPU memory for checkpointing.

For example: with g4dn.2xlarge instance, out of 32 GiB system memory, 16 GiB of system memory needs to be preserved and the remaining 16 GiB of system memory can be used by applications and the system.
Consideration: Time overhead. It is important to consider that it may take more time to copy GPU memory to system memory during checkpoint, and then can also take longer to copy system memory to GPU memory during the restore.
Checkpoint: Keep in mind that MMBatch supports incremental backups for CPUs. For GPUs, MMBatch only supports full backups.

Prerequisite: AWS IAM Policy Configuration

For MMBatch to function correctly, the IAM user or role it uses must have specific AWS permissions. Verify that the assigned IAM policy includes all necessary actions.

Required Permissions

Functionality	Required AWS IAM Actions
Dashboard	`pricing:GetProducts`, `ec2:DescribeSpotPriceHistory`
Auto EBS	`ec2:CreateTags`, `ec2:CreateVolume`, `ec2:CreateSnapshot`, `ec2:DeleteVolume`, `ec2:DeleteSnapshot`, `ec2:DescribeVolumes`, `ec2:DescribeSnapshots`, `ec2:AttachVolume`, `ec2:DetachVolume` ,`batch:DescribeJobs`

How to Verify AWS IAM Policies

Use the AWS Management Console to confirm policy permissions.

Method 1: IAM Policy Simulator (Recommended for quick checks)

Log in to the AWS Console, go to IAM > Policy simulator.
Under Users, Groups, and Roles, select the IAM identity [Your Application Name] uses.
Under Service and action selection:
Select Pricing and add GetProducts.
Select EC2 and add DescribeSpotPriceHistory, CreateTags, CreateVolume, CreateSnapshot, DeleteVolume, DeleteSnapshot, DescribeVolumes, DescribeSnapshots, AttachVolume, DetachVolume.
Click Run simulation.
Ensure all listed actions show "Allowed". If any are "Denied," the policy needs modification.

Method 2: Directly Inspecting Attached Policies

Log in to the AWS Console, go to IAM > Users or Roles.
Click on the IAM user or role name used by [Your Application Name].
Go to the Permissions tab.
Expand each attached policy and review its JSON document.
Confirm that all required actions from the "Required Permissions" table are present within Action elements and have "Effect": "Allow".

If permissions are missing:

An AWS administrator must create or modify an IAM policy to include the missing actions and attach it to the IAM user or role.

Prerequisite: CPU/GPU Compute Instance Types Must Match

A checkpoint created on a CPU or GPU with one architecture must be restored on a CPU or GPU with a compatible architecture. For example, a checkpoint created on an Intel Xeon Platinum 8000 series processor cannot be restored on a Graviton processor. It is essential to have the allowed instance types be CPU architecturally compatible.

AWS Approved CPU/GPU Matching Instance Types

Below is the list of CPU architecture compatible instance types in groups where the instance types in the same group can be used and specified in the allowed instance types of AWS Batch Compute Environment.

Group 1: r5.large - r5.24xlarge

Group 2: r7i.large - r7i.24xlarge

Group 3: m5.large - m5.24xlarge

Group 4: m6i.large - m6i.32xlarge

Group 5: m7i.large - m7i.24xlarge

Group 6: c7i.large - c7i.24xlarge

Group 7: g4dn.2xlarge - g4dn.4xlarge

Group 8: g5.2xlarge - g5.4xlarge

Group 9: g6.2xlarge - g6.4xlarge

Storage Options and Considerations

Knowing where and how to store different types of data is crucial for establishing success of your MMBatch functionality. Below we offer guidance on different types of data and their needs.

Space Considerations for using Managed EBS

During restore when using Managed EBS, the new instance must first do a docker pull of the container image. As the docker image is saved to the root volume, it must be large enough to hold the image. For large containers, such as for GPU workloads, the additional docker pull increases the time required to perform checkpoint-restore by 5-10 minutes, as measured with a 45GB image. Managed EBS will not use space in the root volume for application data.

Storing Checkpoint Data

The following storage and file systems are supported for storing checkpoint data:

AWS EFS
JuiceFS with AWS S3
AWS FSx Lustre

These can be configured in the AWS EC2 Launch Template as a mount point. Checkout our CloudFormation Deployment for a complete deployment walkthrough.

Below are some code examples: for each storage and file system:

AWS EFS
- We will create a mount point and mount EFS:

mkdir -p /mmc-checkpoint 
mount -t efs ${BatchEFSFileSystem}:/ /mmc-checkpoint 
echo "${BatchEFSFileSystem}:/ /mmc-checkpoint efs defaults,_netdev 0 0" >> /etc/fstab 
chown ec2-user:ec2-user /mmc-checkpoint

JuiceFS with AWS S3
- The following code creates the IAM roles, S3, Redis and Required Infra for JuiceFS:

    BatchInstanceRole:
    Type: AWS::IAM::Role
    Properties:
    AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
        - Effect: Allow
            Principal:
            Service: ec2.amazonaws.com
            Action: sts:AssumeRole
    Path: "/"
    ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
        - arn:aws:iam::aws:policy/AmazonS3FullAccess
        - arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
    Policies:
        - PolicyName: "JuiceFSpolicy"
        PolicyDocument:
            Version: "2012-10-17"
            Statement:
            - Effect: Allow
                Action:
                - "elasticache:*"
                Resource: !Sub "arn:aws:elasticache:${AWS::Region}:${AWS::AccountId}:cluster/mm-engine-${UniquePrefix}"
            - Effect: Allow
                Action:
                - "s3:*"
                Resource: !Sub "arn:aws:s3:::mm-engine-juice-fs-${UniquePrefix}/*"
    RoleName: !Sub "mm-batch-instance-role-${UniquePrefix}"

   JuiceFSS3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub "mm-engine-juice-fs-${UniquePrefix}"
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true

Next we will create our Launch Template:

   $ mkdir -p /mmc-checkpoint
   $ chmod 777 /mmc-checkpoint
   $ curl -sSL https://d.juicefs.com/install | sh -

And now we will format and mount JuiceFS:

   /usr/local/bin/juicefs format --storage s3 --bucket https://${JuiceFSS3BucketName}.s3.${AWS::Region}.amazonaws.com  --trash-days=0 "rediss://${RedisClusterEndpoint}:6379/1" juicefs-metadata  \ nohup /usr/local/bin/juicefs mount \
  "rediss://${RedisClusterEndpoint}:6379/1" \
  --cache-dir /mnt/jfs_cache \
  --cache-size 102400 \
  /mnt/jfs > /tmp/juicefs-mount.log 2>&1 &

  echo "Waiting for /mnt/jfs to be mounted..."
  while ! mountpoint -q /mnt/jfs; do
    sleep 2
    echo "Still waiting for /mnt/jfs..."
  done
  echo "/mnt/jfs is now mounted."

  MOUNTPOINT=/mnt/jfs
  CHECKPOINT_DIR=$MOUNTPOINT/mmc-checkpoint

We will now ensure the mount point and subdirectories exist:

$ mkdir -p $CHECKPOINT_DIR
$ chmod 777 $CHECKPOINT_DIR

And finally, we will handle the /mmc-checkpoint symlink:

if [ -e /mmc-checkpoint ]; then
    echo "/mmc-checkpoint exists. Deleting it to recreate as symlink."
    rm -rf /mmc-checkpoint
fi
ln -s $CHECKPOINT_DIR /mmc-checkpoint
echo "Symlink created: /mmc-checkpoint -> $CHECKPOINT_DIR"

AWS FSx Lustre
Configure mmc-checkpoint through the RESTFUL API.

Storing User Scratch Data

MMBatch supports the following storage and file systems for user scratch data:

EBS
JuiceFS with AWS S3
AWS FSx Lustre