Prerequisites and Considerations
When deploying MMBatch, it is important to consider key requirements and pre-requisites to ensure that MMBatch runs smoothly within the Cloud Service Provider's Batch service.
Prerequisite: Using GPU instances with MMBatch
Please note, it is now available, but optional, to use GPU instances with MMBatch. Doing so has many benefits, but it also has considerations, it's own user environment requirements, and limitations compared to CPU instances.
User Environment
-
Environment: NVIDIA Linux driver 570.86 and after
-
AWS service quotas: Make sure you have enough AWS Service Quotas for All G and VT Spot Instance Requests in the region of your deployment. Default should be 0. Recommended number of Service Quotas is based on your anticipated usage.
For example: If we are using AWS g4dn.4xlarge instance, which has 16 vCPU. Since we’ll need at least 2 instances (old instance and the new instance to migrate to), we should have at least 32 vCPUs in quota.
-
MMBatch Version: 1.4 and later
Considerations and Limitations
When deploying GPU instances for MMBatch, please consider the following:
-
Limitation: Single GPU checkpoint and restore. Checkpoint and restore will need to be on the same GPU model.
For example: Checkpoint on g4dn.2xlarge (NVIDIA T4) can be restored on g4dn.4xlarge (NVIDIA T4) but can NOT be restored on g5.2xlarge (NVIDIA A10G).
-
Consideration: Space overhead. System memory needs to preserved with the same amount of GPU memory for checkpointing.
For example: with g4dn.2xlarge instance, out of 32 GiB system memory, 16 GiB of system memory needs to be preserved and the remaining 16 GiB of system memory can be used by applications and the system.
-
Consideration: Time overhead. It is important to consider that it may take more time to copy GPU memory to system memory during checkpoint, and then can also take longer to copy system memory to GPU memory during the restore.
-
Checkpoint: Keep in mind that MMBatch supports incremental backups for CPUs. For GPUs, MMBatch only supports full backups.
Prerequisite: AWS IAM Policy Configuration
For MMBatch to function correctly, the IAM user or role it uses must have specific AWS permissions. Verify that the assigned IAM policy includes all necessary actions.
Required Permissions
Functionality | Required AWS IAM Actions |
---|---|
Dashboard | pricing:GetProducts , ec2:DescribeSpotPriceHistory |
Auto EBS | ec2:CreateTags , ec2:CreateVolume , ec2:CreateSnapshot , ec2:DeleteVolume , ec2:DeleteSnapshot , ec2:DescribeVolumes , ec2:DescribeSnapshots , ec2:AttachVolume , ec2:DetachVolume ,batch:DescribeJobs |
How to Verify AWS IAM Policies
Use the AWS Management Console to confirm policy permissions.
Method 1: IAM Policy Simulator (Recommended for quick checks)
-
Log in to the AWS Console, go to IAM > Policy simulator.
-
Under Users, Groups, and Roles, select the IAM identity [Your Application Name] uses.
-
Under Service and action selection:
-
Select
Pricing
and addGetProducts
. -
Select
EC2
and addDescribeSpotPriceHistory
,CreateTags
,CreateVolume
,CreateSnapshot
,DeleteVolume
,DeleteSnapshot
,DescribeVolumes
,DescribeSnapshots
,AttachVolume
,DetachVolume
. -
Click Run simulation.
-
Ensure all listed actions show "Allowed". If any are "Denied," the policy needs modification.
Method 2: Directly Inspecting Attached Policies
- Log in to the AWS Console, go to IAM > Users or Roles.
- Click on the IAM user or role name used by [Your Application Name].
- Go to the Permissions tab.
- Expand each attached policy and review its JSON document.
- Confirm that all required actions from the "Required Permissions" table are present within
Action
elements and have"Effect": "Allow"
.
If permissions are missing:
An AWS administrator must create or modify an IAM policy to include the missing actions and attach it to the IAM user or role.
Prerequisite: CPU/GPU Compute Instance Types Must Match
A checkpoint created on a CPU or GPU with one architecture must be restored on a CPU or GPU with a compatible architecture. For example, a checkpoint created on an Intel Xeon Platinum 8000 series processor cannot be restored on a Graviton processor. It is essential to have the allowed instance types be CPU architecturally compatible.
AWS Approved CPU/GPU Matching Instance Types
Below is the list of CPU architecture compatible instance types in groups where the instance types in the same group can be used and specified in the allowed instance types of AWS Batch Compute Environment.
Group 1: r5.large - r5.24xlarge
Group 2: r7i.large - r7i.24xlarge
Group 3: m5.large - m5.24xlarge
Group 4: m6i.large - m6i.32xlarge
Group 5: m7i.large - m7i.24xlarge
Group 6: c7i.large - c7i.24xlarge
Group 7: g4dn.2xlarge - g4dn.4xlarge
Group 8: g5.2xlarge - g5.4xlarge
Group 9: g6.2xlarge - g6.4xlarge
Storage Options and Considerations
Knowing where and how to store different types of data is crucial for establishing success of your MMBatch functionality. Below we offer guidance on different types of data and their needs.
Space Considerations for using Managed EBS
During restore when using Managed EBS, the new instance must first do a docker pull of the container image. As the docker image is saved to the root volume, it must be large enough to hold the image. For large containers, such as for GPU workloads, the additional docker pull increases the time required to perform checkpoint-restore by 5-10 minutes, as measured with a 45GB image. Managed EBS will not use space in the root volume for application data.
Storing Checkpoint Data
The following storage and file systems are supported for storing checkpoint data:
-
AWS EFS
-
JuiceFS with AWS S3
-
AWS FSx Lustre
These can be configured in the AWS EC2 Launch Template as a mount point. Checkout our CloudFormation Deployment for a complete deployment walkthrough.
Below are some code examples: for each storage and file system:
-
AWS EFS
- We will create a mount point and mount EFS:
mkdir -p /mmc-checkpoint
mount -t efs ${BatchEFSFileSystem}:/ /mmc-checkpoint
echo "${BatchEFSFileSystem}:/ /mmc-checkpoint efs defaults,_netdev 0 0" >> /etc/fstab
chown ec2-user:ec2-user /mmc-checkpoint
-
JuiceFS with AWS S3
- The following code creates the IAM roles, S3, Redis and Required Infra for JuiceFS:
BatchInstanceRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: ec2.amazonaws.com
Action: sts:AssumeRole
Path: "/"
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
- arn:aws:iam::aws:policy/AmazonS3FullAccess
- arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
Policies:
- PolicyName: "JuiceFSpolicy"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "elasticache:*"
Resource: !Sub "arn:aws:elasticache:${AWS::Region}:${AWS::AccountId}:cluster/mm-engine-${UniquePrefix}"
- Effect: Allow
Action:
- "s3:*"
Resource: !Sub "arn:aws:s3:::mm-engine-juice-fs-${UniquePrefix}/*"
RoleName: !Sub "mm-batch-instance-role-${UniquePrefix}"
JuiceFSS3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub "mm-engine-juice-fs-${UniquePrefix}"
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
- Next we will create our Launch Template:
$ mkdir -p /mmc-checkpoint
$ chmod 777 /mmc-checkpoint
$ curl -sSL https://d.juicefs.com/install | sh -
- And now we will format and mount JuiceFS:
/usr/local/bin/juicefs format --storage s3 --bucket https://${JuiceFSS3BucketName}.s3.${AWS::Region}.amazonaws.com --trash-days=0 "rediss://${RedisClusterEndpoint}:6379/1" juicefs-metadata \ nohup /usr/local/bin/juicefs mount \
"rediss://${RedisClusterEndpoint}:6379/1" \
--cache-dir /mnt/jfs_cache \
--cache-size 102400 \
/mnt/jfs > /tmp/juicefs-mount.log 2>&1 &
echo "Waiting for /mnt/jfs to be mounted..."
while ! mountpoint -q /mnt/jfs; do
sleep 2
echo "Still waiting for /mnt/jfs..."
done
echo "/mnt/jfs is now mounted."
MOUNTPOINT=/mnt/jfs
CHECKPOINT_DIR=$MOUNTPOINT/mmc-checkpoint
- We will now ensure the mount point and subdirectories exist:
- And finally, we will handle the /mmc-checkpoint symlink:
if [ -e /mmc-checkpoint ]; then
echo "/mmc-checkpoint exists. Deleting it to recreate as symlink."
rm -rf /mmc-checkpoint
fi
ln -s $CHECKPOINT_DIR /mmc-checkpoint
echo "Symlink created: /mmc-checkpoint -> $CHECKPOINT_DIR"
-
AWS FSx Lustre
-
Configure
mmc-checkpoint
through the RESTFUL API.
Storing User Scratch Data
MMBatch supports the following storage and file systems for user scratch data:
-
EBS
-
JuiceFS with AWS S3
-
AWS FSx Lustre