Skip to content

GPU Cluster Manager Documentation

mmaictl job create

mmaictl job create¶

Create a Kubernetes Job

Synopsis¶

Create a Kubernetes Job with the specified name, image, and optional command/args. The job will be created and ready to run by default. Use --suspend to create it in a suspended state for Kueue workload management.

mmaictl job create NAME --image=IMAGE [--command=COMMAND] [--args=ARGS] [flags]

Examples¶

  # Create a simple job (ready to run)
  mmaictl job create my-job --image=nginx:latest

  # Create a job with command and args
  mmaictl job create my-job --image=python:3.9 --command=python --args="--version"

  # Create a job with AMD GPU
  mmaictl job create my-job --image=tensorflow/tensorflow:latest --gpu=amd.com/gpu=2

  # Create a job with specific NVIDIA GPU type
  mmaictl job create my-job --image=tensorflow/tensorflow:latest --gpu=nvidia.com/gpu=1

  # Create a job in a specific project
  mmaictl job create my-job --image=nginx:latest --project=my-project

  # Create a job with image pull secret
  mmaictl job create my-job --image=private-registry/my-app:latest --image-pull-secret=my-secret

  # Create an indexed job with custom restart policy
  mmaictl job create my-job --image=python:3.9 --completion-mode=Indexed --restart-policy=OnFailure

  # Create a job with environment variables
  mmaictl job create my-job --image=python:3.9 --env=DEBUG=true,LOG_LEVEL=info

  # Create a job with data volumes and FS group
  mmaictl job create my-job --image=python:3.9 --data-volume=my-data=/data --data-volume=logs=/var/log --fs-group=1000

  # Create a job with checkpoint support
  mmaictl job create my-job --image=tensorflow/tensorflow:latest --enable-checkpointing

  # Create a parallel job with multiple completions
  mmaictl job create batch-job --image=python:3.9 --parallelism=5 --completions=10
  # Create a comprehensive job with all options
  mmaictl job create complex-job \
  --image=tensorflow/tensorflow:latest \
  --command=python \
  --args=train.py,--epochs=100 \
  --cpu=4 --memory=8Gi --gpu=nvidia.com/gpu=2 \
  --project=ml-training \
  --parallelism=2 --completions=1 \
  --completion-mode=NonIndexed \
  --restart-policy=OnFailure \
  --backoff-limit=5 \
  --active-deadline-seconds=7200 \
  --data-volume=training-data=/data \
  --data-volume=model-output=/models \
  --fs-group=1000 \
  --env=CUDA_VISIBLE_DEVICES=0,1,BATCH_SIZE=32 \
  --image-pull-secret=docker-registry-secret \
  --enable-checkpointing

Options¶

      --active-deadline-seconds int32   Maximum time in seconds for the job to run
      --args strings                    Arguments to pass to the command
      --backoff-limit int32             Number of retries before marking job as failed (default 6)
      --command string                  Command to run in the container
      --completion-mode string          Completion mode for the job; can be NonIndexed, Indexed (default "NonIndexed")
      --completions int32               Number of successful completions required (default 1)
      --cpu string                      CPU resource request and limit (e.g., 100m, 1)
      --data-volume stringToString      Data volumes to mount (format: name=mountPath) (default [])
      --enable-checkpointing            Enable checkpoint support for the job
      --env stringToString              Environment variables to set (key=value pairs) (default [])
      --fs-group int                    FS group ID for the security context (default -1)
      --gpu string                      GPU resource request and limit (e.g., 1, nvidia.com/gpu=1, amd.com/gpu=2)
  -h, --help                            help for create
      --image string                    Container image to run
      --image-pull-secret string        Name of the image pull secret
      --memory string                   Memory resource request and limit (e.g., 128Mi, 1Gi)
      --parallelism int32               Number of pods to run in parallel (default 1)
      --project string                  Project that this job belongs to (default to the project in the current context)
      --restart-policy string           Restart policy for the job pods; can be Never, OnFailure (default "Never")
      --suspend                         Create the job in suspended state (default: false)

Options inherited from parent commands¶

  -c, --config string        Path to mmaictl config directory (default "~/.mmaictl")
      --warnings-as-errors   Treat warnings received from the server as errors and exit with a non-zero exit code

SEE ALSO¶

mmaictl job - Operations on jobs