OpCenter Configuration Parameters
Configurable parameters control the behavior of the OpCenter.
Introduction
The OpCenter has configuration parameters that apply to the operation of the OpCenter server or that provide the default settings for jobs submitted to the OpCenter. You can change the values for most of the configuration parameters.
You can view or change parameter values using the CLI or the web interface. For some changes to take effect, you must restart the OpCenter.
Configuration Parameters
The following table shows the OpCenter configuration parameters.
Note
Default values for parameters may differ between OpCenter releases.
Key | Default Value | Editable | Restart Required? | Definition |
---|---|---|---|---|
address | 0.0.0.0:443 | Yes | Yes | Address(es) that OpCenter listens on to receive https requests. Default means all interfaces. |
maxProc | 2 | Yes | Yes | Maximum number of virtual CPUs used by OpCenter processes |
sessionTTL | 168h0m0s | Yes | No | Duration until login token becomes invalid |
sessionTimeout | 1h0m0s | Yes | No | Maximum session inactivity time (minimum value allowed is 1h) |
cloud.candidateInstTypeLimit | 16 | Y | Y | Maximum number of compute instance types selected that match the CPU and memory constraints |
cloud.cpuCompatibleMode | loose | Y | N | Policy for checking whether CPU of new VM is compatible with the snapshot image from previous VM |
cloud.createVMPolicy | spotFirst | Yes | No | Policy that defines which VM pay type is selected first |
cloud.createVMRetryInterval | 10m0s | Yes | No | If VM creation fails in one cycle of attempts, wait this interval before retrying |
cloud.createVMRetryLimit | 3 | Yes | No | Number of VM creation attempt cycles before VM creation is abandoned |
cloud.createVolumeRetryLimit | 6 | Yes | No | Maximum number of attempts to create storage volume |
cloud.enableCarbonEmission | true | Yes | No | Enable the carbon emissions calculator |
cloud.floatVolSizeLimit | 128 | Yes | No | Worker node memory size threshold below which OpCenter automatically creates a volume to store snapshot images |
cloud.handleRebalanceMemThreshold | 64G | Yes | No | Threshold memory size above which AWS Rebalance Recommendation signal triggers job migration |
cloud.imageVolumeSize | 6 | Yes | No | Size of volume used as container root volume |
cloud.imageVolumeType | gp2 | Yes | No | Type of volume used as container root volume |
cloud.instTypeBanTTL | 3h | Yes | Yes | Duration for which a VM instance is quarantined after an attempt to create VM of this type fails |
cloud.instTypeRetryLimit | 10 | Yes | No | Number of VM instance types included in one VM creation attempt cycle |
cloud.maxSpotReclaim | 0 (unlimited) | Yes | No | Number of spot reclaim events allowed for each job |
cloud.miUpdateInterval | 1h0m0s | Yes | No | Interval between checks for updates to the machine image repository |
cloud.recreateVMRetryLimit | 120h0m0s | Yes | No | Maximum time allowed for OpCenter to create a VM to restore a snapshot to a running state |
cloud.reqRetryInterval | 1s | Yes | Yes | Interval between VM instance creation attempts within a cycle |
cloud.reqRetryLimit | 6 | Yes | Yes | Maximum number of VM creation attempts within one cycle |
cloud.securityGroups | sg-*** | Yes | No | Security group(s) applied to every worker node |
cloud.securityRole | quota-mvWorkerNodeProfile-** | Yes | No | IAM role assigned to worker node |
cloud.snapLocation | local | Yes | No | Location where snapshot images are stored |
cloud.swapFileSize | 4G | Yes | No | Size of memory swap space configured for each worker node |
cloud.vmInitTimeout | 20m0s | Yes | No | Maximum time allowed to create VM |
gui.autoRefreshInterval | 5m0s | Yes | No | Interval between automatic refreshes of OpCenter web interface display |
history.enabled | true | Yes | No | Flag to enable (or disable) the service that compiles a history of job metadata |
image.cachePath | s3://opcenter-bucket-***/images | Yes | No | Location where container images are cached |
image.imageUpdateInterval | 10m | Yes | Yes | Interval between refreshes of the container image library |
license.licenseCheckInterval | 30m0s | Yes | Yes | Interval between checks of license status |
license.licenseServer | https://license.memverge.com | Yes | No | URL to access MemVerge license server |
log.file | /var/log/memverge/opcenter.log | No | NA | Path to OpCenter log file (on OpCenter server) |
log.hostLogRetainTime | 168h0m0s | Yes | No | Maximum age of any host log (older logs are automatically deleted) |
log.level | info | Yes | No | Linux-style log level for recording OpCenter events |
log.maxBackups | 10 | Yes | No | Maximum number of logs of each type |
log.maxSize | 10 | Yes | No | Maximum size of each log |
metrics.ocMetricsInterval | 10s | Yes | No | Interval between updates to the OpCenter metrics |
ocMetricsRetention | 2160h0m0s | Yes | Yes | Maximum age of OpCenter metric files (files older than this value are deleted) |
migrate.cpuDisable | true | Yes | No | Option to disable (or enable if set to false) WaveRider based on CPU utilization |
migrate.cpuLimit | 0 | Yes | No | Upper limit on the number of virtual CPUs when migrating to a larger CPU (0 means no limit) |
migrate.cpuLowerBoundDuration | 5m0s | Yes | No | Time that CPU utilization must remain below the lower threshold for CPU utilization to trigger job migration to a smaller CPU |
migrate.cpuLowerBoundRatio | 5 | Yes | No | Lower threshold (measured as a percentage of the maximum utilization) for CPU utilization |
migrate.cpuLowerLimit | 0 | Yes | No | Lower limit on the number of virtual CPUs when migrating to a smaller CPU (0 means no limit) |
migrate.cpuMigrateStep | 50 | Yes | No | Percentage increase (or decrease) in the number of virtual CPUs when migrating to a larger (or smaller) CPU |
migrate.cpuUpperBoundDuration | 2m0s | Yes | No | Time that CPU utilization must remain above the upper threshold for CPU utilization to trigger job migration to a larger CPU |
migrate.cpuUpperBoundRatio | 90 | Yes | No | Upper threshold (measured as a percentage of the maximum utilization) for CPU utilization |
migrate.createVMFirst | true | Yes | No | Option to create new VM instance before capturing snapshot. Setting this to false means that the snapshot is captured before the new VM instance is created. |
migrate.diskReadyTimeout | 10m0s | Yes | No | Maximum time allowed to attach a volume to store snapshot images in cases where the snapshot volume is not created automatically when the job starts |
migrate.enableAutoMigrate | true | Yes | No | Option to turn WaveRider on (or off) |
migrate.evadeOOM | true | Yes | No | Option to turn out-of-memory (OOM) protection on (or off). OOM protection means that any use of memory swap space triggers a job migration to a VM with more memory. |
migrate.memDisable | true | Yes | No | Option to ignore (true) or respond to (false) memory utilization when evaluating whether to migrate job |
migrate.memLimit | 0 | Yes | No | Upper limit on memory size when migrating to a VM with more memory (0 means no limit) |
migrate.memLowerBoundDuration | 5m0s | Yes | No | Time that memory utilization must remain below the lower threshold for memory utilization to trigger job migration to a VM with less memory |
migrate.memLowerBoundRatio | 5 | Yes | No | Lower threshold (measured as a percentage of the maximum utilization) for memory utilization |
migrate.memLowerLimit | 0 | Yes | No | Lower limit on memory size when migrating to a VM with less memory (0 means no limit) |
migrate.memMigrateStep | 50 | Yes | No | Percentage increase (or decrease) in memory size when migrating to a VM with more (or less) memory |
migrate.memUpperBoundDuration | 2m0s | Yes | No | Time that memory utilization must remain above the upper threshold for memory utilization to trigger job migration to a VM with more memory |
migrate.memUpperBoundRatio | 90 | Yes | No | Upper threshold (measured as a percentage of the maximum utilization) for memory utilization |
migrate.oomCheckpointTimeout | 1h0m0s | Yes | No | Maximum time allowed to capture a memory snapshot in cases where OOM protection triggers job migration |
migrate.oomNoInstanceTypePolicy | Yes | No | Action taken when OOM protection is triggered and no suitable VM instance is found to migrate to. An example action is "autoSuspend". | |
migrate.stepAuto | true | Yes | No | Automatically calculate the step size (in the number of virtual CPUs or memory size) when migrating to a larger (or smaller) VM |
provider.allowList | [*] | Yes | No | List of VM instance types that specifies which instances are allowed when creating a new VM |
provider.denyList | [ ] | Yes | No | List of VM instance types that specifies which instances are NOT allowed when creating a new VM |
provider.gpuNameAllowList | [h100 v100 a100 t4 t4g m60 a10g] | Yes | No | List of GPU types that specifies which instances are allowed when creating a new VM |
provider.gpuVendorAllowList | [nvidia] | Yes | No | List of GPU vendors that specifies which instances are allowed when creating a new VM |
quota.autoResume | true | Yes | No | Action applied to a job, suspended because quota limit reached, after the quota is replenished. |
quota.calcInterval | 1h0m0s | Yes | No | Interval between checks of the current job cost against the quota limit |
quota.coldSuspend | false | Yes | No | Type of suspend mode applied when quota limit reached or exceeded |
quota.notifyThreshold | 80 | Yes | No | Threshold (measured as a percentage of the quota limit) that triggers an alert to users that quota limit approaching |
quota.overageAction | cancel | Yes | No | Action applied to job when quota limit reached or exceeded |
report.updateInterval | 1h0m0s | Yes | No | Interval between reports of job usage metrics (core hours) to the license server |
scheduler.cloudParamsTTL | 3m | Yes | No | Lifetime of cache that stores the mapping of job parameters (cpu and memory) to VM instance type (used to create VM when job state changes from "submitted" to "initializing") |
scheduler.defaultDumpMode | full | Yes | No | Type of memory snapshot (full or incremental) |
scheduler.dirtyPageCheckInterval | 10s | Yes | No | Interval between checks of the dirty memory page count (dirty pages are pages whose content has changed) |
scheduler.dirtyPageThreshold | 9G | Yes | No | Threshold (determined by aggregate size of dirty memory pages) that triggers an incremental memory snapshot |
scheduler.enableResourceCleanup | true | Yes | No | Enable resource clean-up service (checks that all resources associated with completed, failed or canceled jobs are deleted) |
scheduler.executorPollInterval | 10ms | Yes | No | Interval between checks of the OpCenter executor status to ensure that queued jobs can be processed in time |
scheduler.extWorkPath | Yes | No | URI that identifies path to external jobs | |
scheduler.jobArchiveInterval | 168h0m0s | Yes | No | Maximum age of jobs in the "normal" state. Jobs older than this are in the "archive" state. |
scheduler.jobCleanupInterval | 1m0s | Yes | No | Interval between runs of the job resource clean-up service |
scheduler.jobCloudParamsCacheTTL | 1h | Yes | No | Lifetime of cache that stores verified job parameters |
scheduler.jobExecutorLimit | 128 | Yes | No | Maximum number of jobs processed in parallel |
scheduler.jobOptimizeInterval | 10m0s | Yes | No | Interval between attempts to migrate a job from an on-demand instance to a spot instance |
scheduler.jobTTL | 8640h0m0s | Yes | No | Maximum duration allowed for any job |
scheduler.jobUpdateInterval | 10s | Yes | No | Interval between checks of job status |
scheduler.resourceCleanupInterval | 24h0m0s | Yes | No | Interval between runs of the OpCenter resource clean-up service |
scheduler.workPath | /mnt/memverge/slurm/work | No | No | Path to NFS-shared directory required by slurm scheduler |
security.cacheTTL | 1m0s | Yes | Yes | Lifetime of cache holding authentication tokens |
security.certificateFolder | /etc/memverge/certs | Yes | Yes | Path to folder where security certificates are stored |
security.inlineUidBoundary | 0 | Yes | No | Offset applied to user UID when mapping user UID on OpCenter to user UID on work node |
security.persistToken | false | Yes | No | Action applied to login authentication tokens when OpCenter restarts. Persist means tokens are saved. |
storage.updateInterval | 1h | Yes | No | Duration after which an inactive file system based on a registered storage service is unmounted by OpCenter |
template.templateSyncInterval | 24h0m0s | Yes | No | Interval between synchronization checks between OpCenter and MemVerge template repository |
template.templateUri | s3://mmce-data/templates-production | Yes | No | Location of MemVerge template repository |
upgrade.cacheFolder | /tmp/opcenter_builds | Yes | No | Path to stage new release (and associated metadata) before upgrading |
upgrade.checkInterval | 1h0m0s | Yes | No | Interval between checks for new OpCenter releases |
upgrade.cloudStorePath | s3://opcenter-bucket-*** | Yes | Yes | Location where OpCenter upgrade package is cached so it can be downloaded by worker nodes |
upgrade.releaseUri | s3://float-package | Yes | No | Location where available float releases are stored |
workflow.updateInterval | 5s | Yes | No | Interval between updates to the workflow view displayed in the OpCenter web interface |