mmaictl job create¶
Create a Kubernetes Job
Synopsis¶
Create a Kubernetes Job with the specified name, image, and optional command/args. The job will be created and ready to run by default. Use --suspend to create it in a suspended state for Kueue workload management.
Examples¶
# Create a simple job (ready to run)
mmaictl job create my-job --image=nginx:latest
# Create a job with command and args
mmaictl job create my-job --image=python:3.9 --command=python --args="--version"
# Create a job with AMD GPU
mmaictl job create my-job --image=tensorflow/tensorflow:latest --gpu=amd.com/gpu=2
# Create a job with specific NVIDIA GPU type
mmaictl job create my-job --image=tensorflow/tensorflow:latest --gpu=nvidia.com/gpu=1
# Create a job in a specific project
mmaictl job create my-job --image=nginx:latest --project=my-project
# Create a job with image pull secret
mmaictl job create my-job --image=private-registry/my-app:latest --image-pull-secret=my-secret
# Create an indexed job with custom restart policy
mmaictl job create my-job --image=python:3.9 --completion-mode=Indexed --restart-policy=OnFailure
# Create a job with environment variables
mmaictl job create my-job --image=python:3.9 --env=DEBUG=true,LOG_LEVEL=info
# Create a job with data volumes and FS group
mmaictl job create my-job --image=python:3.9 --data-volume=my-data=/data --data-volume=logs=/var/log --fs-group=1000
# Create a job with checkpoint support
mmaictl job create my-job --image=tensorflow/tensorflow:latest --enable-checkpointing
# Create a parallel job with multiple completions
mmaictl job create batch-job --image=python:3.9 --parallelism=5 --completions=10
# Create a comprehensive job with all options
mmaictl job create complex-job \
--image=tensorflow/tensorflow:latest \
--command=python \
--args=train.py,--epochs=100 \
--cpu=4 --memory=8Gi --gpu=nvidia.com/gpu=2 \
--project=ml-training \
--parallelism=2 --completions=1 \
--completion-mode=NonIndexed \
--restart-policy=OnFailure \
--backoff-limit=5 \
--active-deadline-seconds=7200 \
--data-volume=training-data=/data \
--data-volume=model-output=/models \
--fs-group=1000 \
--env=CUDA_VISIBLE_DEVICES=0,1,BATCH_SIZE=32 \
--image-pull-secret=docker-registry-secret \
--enable-checkpointing
Options¶
--active-deadline-seconds int32 Maximum time in seconds for the job to run
--args strings Arguments to pass to the command
--backoff-limit int32 Number of retries before marking job as failed (default 6)
--command string Command to run in the container
--completion-mode string Completion mode for the job; can be NonIndexed, Indexed (default "NonIndexed")
--completions int32 Number of successful completions required (default 1)
--cpu string CPU resource request and limit (e.g., 100m, 1)
--data-volume stringToString Data volumes to mount (format: name=mountPath) (default [])
--enable-checkpointing Enable checkpoint support for the job
--env stringToString Environment variables to set (key=value pairs) (default [])
--fs-group int FS group ID for the security context (default -1)
--gpu string GPU resource request and limit (e.g., 1, nvidia.com/gpu=1, amd.com/gpu=2)
-h, --help help for create
--image string Container image to run
--image-pull-secret string Name of the image pull secret
--memory string Memory resource request and limit (e.g., 128Mi, 1Gi)
--parallelism int32 Number of pods to run in parallel (default 1)
--project string Project that this job belongs to (default to the project in the current context)
--restart-policy string Restart policy for the job pods; can be Never, OnFailure (default "Never")
--suspend Create the job in suspended state (default: false)
Options inherited from parent commands¶
-c, --config string Path to mmaictl config directory (default "~/.mmaictl")
--warnings-as-errors Treat warnings received from the server as errors and exit with a non-zero exit code
SEE ALSO¶
- mmaictl job - Operations on jobs