OpCenter Configuration Parameters
Configurable parameters control the behavior of the OpCenter.
Introduction
The OpCenter has configuration parameters that apply to the operation of the OpCenter server or that provide the default settings for jobs submitted to the OpCenter. You can change the values for most of the configuration parameters.
You can view or change parameter values using the CLI or the web interface. For some changes to take effect, you must restart the OpCenter.
Configuration Parameters
The following table shows the OpCenter configuration parameters.
Note
Default values for parameters may differ between OpCenter releases.
Key | Default Value | Editable | Restart Required? | Definition |
---|---|---|---|---|
address | 0.0.0.0:443 | Yes | Yes | Address(es) that OpCenter listens on to receive https requests. Default means all interfaces. |
maxProc | 2 | Yes | Yes | Maximum number of virtual CPUs used by OpCenter processes |
sessionTTL | 168h0m0s | Yes | No | Duration until login token becomes invalid |
sessionTimeout | 1h0m0s | Yes | No | Maximum session inactivity time (minimum value allowed is 1h) |
cloud.candidateInstTypeLimit | 16 | Yes | Yes | Maximum number of compute instance types selected that match the CPU and memory constraints |
cloud.cpuCompatibleMode | loose | Yes | No | Policy for checking whether CPU of new VM is compatible with the snapshot image from previous VM |
cloud.createVMPolicy | spotFirst | Yes | No | Policy that defines which VM pay type is selected first |
cloud.createVMPriceLimit | 0 | Yes | No | Maximum hourly rate above which VM is not a candidate for the job |
cloud.createVMPriceLimitPercent | 0 | Yes | No | Maximum spot instance hourly rate, measured as a percentage of the equivalent on-demand hourly rate, above which VM is not a candidate for the job |
cloud.createVMRetryInterval | 10m0s | Yes | No | If VM creation fails in one cycle of attempts, wait this interval before retrying |
cloud.createVMRetryLimit | 1 | Yes | No | Number of VM creation attempt cycles before VM creation is abandoned |
cloud.createVolumeRetryLimit | 6 | Yes | No | Maximum number of attempts to create storage volume |
cloud.enableCarbonEmission | true | Yes | No | Enable the carbon emissions calculator |
cloud.floatVolSizeLimit | 128 | Yes | No | Worker node memory size threshold below which OpCenter automatically creates a volume to store snapshot images |
cloud.handleRebalanceMemThreshold | 64G | Yes | No | Threshold memory size above which AWS Rebalance Recommendation signal triggers job migration |
cloud.imageVolumeSize | 6 | Yes | No | Size of volume used as container root volume |
cloud.imageVolumeType | gp2 | Yes | No | Type of volume used as container root volume |
cloud.instTypeBanTTL | 3h | Yes | Yes | Duration for which a VM instance is quarantined after an attempt to create VM of this type fails |
cloud.instTypeCachePath | Yes | No | File path (local or in the cloud) to store instance type information. Use in cases where the OpCenter is deployed in a private VPC and cannot query the cloud API to get instance types. In such a case, the OpCenter retrieves instance type information from the cache. | |
cloud.instTypeOrderMethod | price | Yes | Yes | Criterion used to order candidate instance types |
cloud.instTypeRetryLimit | 10 | Yes | No | Number of VM instance types included in one VM creation attempt cycle |
cloud.interruptionPoss | true | Yes | No | When set to true, check the likelihood of spot instance reclaim when determining candidate instance types. (Not used currently.) |
cloud.maxSpotReclaim | 3 | Yes | No | Number of spot reclaim events allowed for each job until job moves on-demand instance |
cloud.miUpdateInterval | 1h0m0s | Yes | No | Interval between checks for updates to the machine image repository |
cloud.nameserver | Yes | No | IP address of domain name server (use to override the default domain name server) | |
cloud.recreateVMRetryLimit | 120h0m0s | Yes | No | Maximum time allowed for OpCenter to create a VM to restore a snapshot to a running state |
cloud.reqRetryInterval | 1s | Yes | Yes | Interval between VM instance creation attempts within a cycle |
cloud.reqRetryLimit | 6 | Yes | Yes | Maximum number of VM creation attempts within one cycle |
cloud.securityGroups | sg-*** | Yes | No | Security group(s) applied to every worker node |
cloud.securityRole | [OpCenter_name]-mvWorkerNodeProfile-** | Yes | No | IAM role assigned to worker node |
cloud.snapLocation | local | Yes | No | Location where snapshot images are stored |
cloud.snapSkipOpenFileList | Yes | No | On checkpoint or restore, skip file size checks on files in this list (wildcards supported) | |
cloud.subnetIPCountLowerLimit | 5 | Yes | Yes | Lower limit of number of available IP addresses in a subnet. If subnet has fewer available IP addresses than this limit, subnet ignored. |
cloud.subnetList | [ ] | Yes | No | List of subnets in which to create VMs for jobs |
cloud.swapDurationOnOOM | 0s | Yes | No | Time threshold to trigger OOM migration after swap space usage passes threshold |
cloud.swapFileSize | 4G | Yes | No | Size of swap space configured for each worker node |
cloud.swapUsageOnOOM | 0.5 | Yes | No | Amount of swap space (measured as a fraction of the total swap space) that must be used before starting OOM duration counter. If both thresholds crossed, OOM migration triggered. |
cloud.swapVolSizeLimit | 16 | Yes | No | Maximum swap capacity (in GB) above which a dedicated volume for swap space created |
cloud.swapVolType | gp3 | Yes | No | If a dedicated swap space volume created, type of volume used |
cloud.vmInitTimeout | 20m0s | Yes | No | Maximum time allowed to create VM |
gui.autoRefreshInterval | 5m0s | Yes | No | Interval between automatic refreshes of OpCenter web interface display |
gui.defaultJobFilterUpdate | 2016h0m0s | Yes | No | Default filter applied to job listings in the Jobs dashboard is update<=duration where duration is specified by gui.defaultJobFilterUpdate |
history.enabled | true | Yes | No | Flag to enable (or disable) the service that compiles a history of job metadata |
image.cachePath | file:///mnt/memverge/image | Yes | No | Location where container images are cached |
image.defaultFilter | Yes | No | Default filter applied to listing of container images, for example, "category=data_science" | |
image.imageUpdateInterval | 10m | Yes | Yes | Interval between refreshes of the container image library |
license.licenseCheckInterval | 30m0s | Yes | Yes | Interval between checks of license status |
license.licenseServer | https://license.memverge.com | Yes | No | URL to access MemVerge license server |
log.file | /var/log/memverge/opcenter.log | No | NA | Path to OpCenter log file (on OpCenter server) |
log.hostLogRetainTime | 168h0m0s | Yes | No | Maximum age of any host log (older logs are automatically deleted) |
log.level | info | Yes | No | Linux-style log level for recording OpCenter events |
log.logPruneFreeSpaceRatio | 0.4 | Yes | No | If set to true, log files pruned when minimum free space ratio crosses threshold on disk supporting logs (/mnt/memverge ) |
log.logPruneMinSpaceRatio | 3 | Yes | No | Minimum free disk space capacity (in GB) that triggers log pruning |
log.maxBackups | 10 | Yes | No | Maximum number of logs of each type |
log.maxSize | 10 | Yes | No | Maximum size of each log (in GB) |
metrics.ocMetricsInterval | 10s | Yes | No | Interval between updates to the OpCenter metrics |
metrics.ocMetricsRetention | 2160h0m0s | Yes | Yes | Maximum age of OpCenter metric files (files older than this value are deleted) |
migrate.abortUnderOOMKiller | false | Yes | No | If set to false, OpCenter ignores OOM scores assigned by linux kernel. If set to true, OpCenter kills jobs with high OOM scores rather than migrate because of OOM trigger. |
migrate.cpuDisable | true | Yes | No | Option to disable (or enable if set to false) WaveRider based on CPU utilization |
migrate.cpuLimit | 0 | Yes | No | Upper limit on the number of virtual CPUs when migrating to a larger CPU (0 means no limit) |
migrate.cpuLowerBoundDuration | 5m0s | Yes | No | Time that CPU utilization must remain below the lower threshold for CPU utilization to trigger job migration to a smaller CPU |
migrate.cpuLowerBoundRatio | 5 | Yes | No | Lower threshold (measured as a percentage of the maximum utilization) for CPU utilization |
migrate.cpuLowerLimit | 0 | Yes | No | Lower limit on the number of virtual CPUs when migrating to a smaller CPU (0 means no limit) |
migrate.cpuMigrateStep | 50 | Yes | No | Percentage increase (or decrease) in the number of virtual CPUs when migrating to a larger (or smaller) CPU |
migrate.cpuUpperBoundDuration | 2m0s | Yes | No | Time that CPU utilization must remain above the upper threshold for CPU utilization to trigger job migration to a larger CPU |
migrate.cpuUpperBoundRatio | 90 | Yes | No | Upper threshold (measured as a percentage of the maximum utilization) for CPU utilization |
migrate.createVMFirst | true | Yes | No | Option to create new VM instance before capturing snapshot. Setting this to false means that the snapshot is captured before the new VM instance is created. |
migrate.diskReadyTimeout | 10m0s | Yes | No | Maximum time allowed to attach a volume to store snapshot images in cases where the snapshot volume is not created automatically when the job starts |
migrate.enableAutoMigrate | true | Yes | No | Option to turn WaveRider on (or off) |
migrate.evadeOOM | true | Yes | No | Option to turn out-of-memory (OOM) protection on (or off). OOM protection means that use of memory swap space triggers a job migration to a VM with more memory. |
migrate.incompatibleInstTypeRetryLimit | 16 | Yes | No | Maximum number of attempts to create a compatible VM instance when migrating a job |
migrate.memDisable | true | Yes | No | Option to ignore (true) or respond to (false) memory utilization when evaluating whether to migrate job |
migrate.memLimit | 0 | Yes | No | Upper limit on memory size when migrating to a VM with more memory (0 means no limit) |
migrate.memLowerBoundDuration | 5m0s | Yes | No | Time that memory utilization must remain below the lower threshold for memory utilization to trigger job migration to a VM with less memory |
migrate.memLowerBoundRatio | 5 | Yes | No | Lower threshold (measured as a percentage of the maximum utilization) for memory utilization |
migrate.memLowerLimit | 0 | Yes | No | Lower limit on memory size when migrating to a VM with less memory (0 means no limit) |
migrate.memMigrateStep | 50 | Yes | No | Percentage increase (or decrease) in memory size when migrating to a VM with more (or less) memory |
migrate.memUpperBoundDuration | 2m0s | Yes | No | Time that memory utilization must remain above the upper threshold for memory utilization to trigger job migration to a VM with more memory |
migrate.memUpperBoundRatio | 90 | Yes | No | Upper threshold (measured as a percentage of the maximum utilization) for memory utilization |
migrate.oomCheckpointTimeout | 1h0m0s | Yes | No | Maximum time allowed to capture a memory snapshot in cases where OOM protection triggers job migration |
migrate.oomNoInstanceTypePolicy | Yes | No | Action taken when OOM protection is triggered and no suitable VM instance is found to migrate to. An example action is "autoSuspend". | |
migrate.optimizeCost | false | Yes | No | If set to true, enable cost optimization policy. |
migrate.optimizeThreshold | 0.9 | Yes | No | Migrate job to a new instance if current instance cost is more than migrate.optimizeThreshold (measured in $ per hour) |
migrate.stepAuto | true | Yes | No | Automatically calculate the step size (in the number of virtual CPUs or memory size) when migrating to a larger (or smaller) VM |
provider.allowList | [*] | Yes | No | List of VM instance types that specifies which instances are allowed when creating a new VM |
provider.denyList | [ ] | Yes | No | List of VM instance types that specifies which instances are NOT allowed when creating a new VM |
provider.gpuNameAllowList | [h100 v100 a100 t4 t4g m60 a10g] | Yes | No | List of GPU types that specifies which instances are allowed when creating a new VM |
provider.gpuVendorAllowList | [nvidia] | Yes | No | List of GPU vendors that specifies which instances are allowed when creating a new VM |
quota.autoResume | true | Yes | No | Action applied to a job, suspended because quota limit reached, after the quota is replenished. |
quota.calcInterval | 1h0m0s | Yes | No | Interval between checks of the current job cost against the quota limit |
quota.coldSuspend | false | Yes | No | Type of suspend mode applied when quota limit reached or exceeded |
quota.notifyThreshold | 80 | Yes | No | Threshold (measured as a percentage of the quota limit) that triggers an alert to users that quota limit approaching |
quota.overageAction | cancel | Yes | No | Action applied to job when quota limit reached or exceeded |
report.customerDefFile | Yes | Yes | Path to file that defines customer-specific ratios used to calculate customer bill. Ratios are: scheduler, compute, and storage. | |
report.externalCostFolder | Yes | Yes | Location of information included in external cost in customer bills | |
report.timeDiff | 0s | Yes | No | Adjustment to UTC to produce reports specific to customer's time zone |
report.updateInterval | 1h0m0s | Yes | No | Interval between reports of job usage metrics (core hours) to the license server |
scheduler.cloudParamsTTL | 3m | Yes | No | Lifetime of cache that stores the mapping of job parameters (cpu and memory) to VM instance type (used to create VM when job state changes from "submitted" to "initializing") |
scheduler.defaultDumpMode | full | Yes | No | Type of memory snapshot (full or incremental) |
scheduler.dirtyPageCheckInterval | 10s | Yes | No | Interval between checks of the dirty memory page count (dirty pages are pages whose content has changed) |
scheduler.dirtyPageThreshold | 9G | Yes | No | Threshold (determined by aggregate size of dirty memory pages) that triggers an incremental memory snapshot |
scheduler.enableResourceCleanup | true | Yes | No | Enable resource clean-up service (checks that all resources associated with completed, failed or canceled jobs are deleted) |
scheduler.executorPollInterval | 10ms | Yes | No | Interval between checks of the OpCenter executor status to ensure that queued jobs can be processed in time |
scheduler.extWorkPath | Yes | No | URI that identifies path to external jobs | |
scheduler.jobArchiveInterval | 30m0s | Yes | No | Maximum age of jobs in the "normal" state. Jobs older than this are in the "archive" state. |
scheduler.jobCleanupInterval | 1m0s | Yes | No | Interval between runs of the job resource clean-up service |
scheduler.jobCloudParamsCacheTTL | 1h | Yes | No | Lifetime of cache that stores verified job parameters |
scheduler.jobExecutorLimit | 128 | Yes | No | Maximum number of jobs processed in parallel |
scheduler.jobOptimizeInterval | 10m0s | Yes | No | Interval between attempts to migrate a job from an on-demand instance to a spot instance |
scheduler.jobTTL | 8640h0m0s | Yes | No | Maximum duration allowed for any job |
scheduler.jobUpdateInterval | 10s | Yes | No | Interval between checks of job status |
scheduler.resourceCleanupInterval | 24h0m0s | Yes | No | Interval between runs of the OpCenter resource clean-up service |
scheduler.workPath | /mnt/memverge/slurm/work | No | No | Path to NFS-shared directory required by slurm scheduler |
security.cacheTTL | 1m0s | Yes | Yes | Lifetime of cache holding authentication tokens |
security.certificateFolder | /etc/memverge/certs | Yes | Yes | Path to folder where security certificates are stored |
security.inlineUidBoundary | 0 | Yes | No | Offset applied to user UID when mapping user UID on OpCenter to user UID on work node |
security.persistToken | false | Yes | No | Action applied to login authentication tokens when OpCenter restarts. Persist means tokens are saved. |
storage.updateInterval | 1h | Yes | No | Duration after which an inactive file system based on a registered storage service is unmounted by OpCenter |
template.templateSyncInterval | 24h0m0s | Yes | No | Interval between synchronization checks between OpCenter and MemVerge template repository |
template.templateUri | s3://mmce-data/templates-production | Yes | No | Location of MemVerge template repository |
upgrade.cacheFolder | /tmp/opcenter_builds | Yes | No | Path to stage new release (and associated metadata) before upgrading |
upgrade.checkInterval | 1h0m0s | Yes | No | Interval between checks for new OpCenter releases |
upgrade.cloudStorePath | s3://opcenter-bucket-*** | Yes | Yes | Location where OpCenter upgrade package is cached so it can be downloaded by worker nodes |
upgrade.releaseUri | s3://float-package | Yes | No | Location where available float releases are stored |
workflow.updateInterval | 5s | Yes | No | Interval between updates to the workflow view displayed in the OpCenter web interface |