Policy-driven Job Migration

When enabled, WaveRider uses a rules-based policy to determine when to migrate a job to a different virtual machine

About this task

The --migratePolicy option can be used with the float sbatch command to define the rules that determine when a job is migrated to a new virtual machine. The policy can be added to a running job or an existing policy associated with a running job can be changed by using the float modify command. Equivalent actions are available using the web interface.

The policy works as follows. If the upper threshold for CPU or memory utilization is crossed and utilization remains elevated for a specified interval, the job is migrated to a virtual machine that has more virtual CPUs or more memory. The increase in size is measured as a percentage defined by the step parameter. Similar behavior occurs if the lower threshold is crossed and utilization remains low for a specified interval: the job is moved to a smaller virtual machine.

Procedure

  1. Turn policy-driven migration on (default is off).
    • CLI: To enable policy-driven migration, use --migratePolicy [enable=true] or --migratePolicy [disable=false].
    • Web interface: On the Submit Job screen, go to Advanced section and toggle the Auto-Migration Policy setting from Off to On.
  2. If needed, override default values for policy rules.
    • CLI: Attach parameters to --migratePolicy as a string enclosed in square brackets (only include parameters that have values different from the default).
      The following parameters can be included in the string (default values listed in parentheses). If a unit is not shown, the value is a percentage of the maximum possible.
      • cpu.upperBoundRatio (90): upper threshold for utilization per virtual CPU (percentage)
      • cpu.lowerBoundRatio (5): lower threshold for utilization per virtual CPU (percentage)
      • cpu.upperBoundDuration (30s): time that utilization per virtual CPU must remain above the upper threshold before migration is triggered
      • cpu.lowerBoundDuration (5m0s): time that utilization per virtual CPU must remain below the lower threshold before migration is triggered
      • cpu.step (50): The percentage increase (or decrease) in the number of virtual CPUs in the new virtual machine versus the original virtual machine
      • cpu.limit (use 0 for unlimited): The maximum number of vCPUs allowed. If a job migrates to a VM with this number of vCPUs, then migration to a VM with more vCPUs is not permitted. Migration to a VM with fewer vCPUs is permitted.
      • mem.upperBoundRatio (90): upper threshold for memory utilization (percentage)
      • mem.lowerBoundRatio (5): lower threshold for memory utilization (percentage)
      • mem.upperBoundDuration (30s): time that memory utilization must remain above the upper threshold before migration is triggered
      • mem.lowerBoundDuration (5m0s): time that memory utilization must remain below the lower threshold before migration is triggered
      • mem.step (50): The percentage increase (or decrease) in memory capacity of the new virtual machine versus the original virtual machine
      • mem.limit (use 0 for unlimited): The maximum memory capacity (in GB) allowed. If a job migrates to a VM with this memory capacity, then migration to a VM with more memory is not permitted. Migration to a VM with less memory is permitted.
      • stepAuto (false): If stepAuto is set to true, then values for cpu.step and mem.step are calculated dynamically before each migration. Do not combine with cpu.step and mem.step.
      • evade.OOM (false): If set to true, then the values set for cpu.limit and mem.limit are overridden. If an application touches swap space, migration continues until the largest VM offered by the CSP is reached.
    • Web interface: In the Advanced section of the Submit Job screen, change values in fields pre-populated with default values. You can view the changes in the generated command line on the right-hand side.
  3. Submit job with auto-migration policy included.
    • CLI: Use the float submit command with the --migratePolicy option.
      Example:
      float submit -i python -j ./python_job_script.sh --dataVolume [size=10]:/data -c 4 -m 8 --migratePolicy [enable=true,mem.upperBoundRatio=60]
      id: 8dF5j3yTnuzbHFKUIJTYt
      name: python-
      user: tester
      imageID: docker.io/bitnami/python:latest
      status: Submitted
      submitTime: "2023-01-22T23:25:47Z"
      duration: 7s
      inputArgs: ' -j ./python_job_script.sh  -i python  --migratePolicy [enable=true,mem.upperBoundRatio=60]  -m 8  -c 4  --dataVolume [size=10]:/data '
      vmPolicy:
          policy: spotFirst
          retryLimit: 3
          retryInterval: 10m0s
      migratePolicy:
          cpu:
              upperBoundRatio: 90
              lowerBoundRatio: 5
              upperBoundDuration: 30s
              lowerBoundDuration: 5m0s
              step: 50
          mem:
              upperBoundRatio: 60
              lowerBoundRatio: 5
              upperBoundDuration: 30s
              lowerBoundDuration: 5m0s
              step: 50
    • Web interface: Fill in the fields (including those in the Advanced section) in the Submit Job screen and then click on Submit.
  4. Modify auto-migration policy associated with running job.
    • CLI: To modify the auto-migration policy associated with a running job or to turn auto-migration on, use float modify --migratePolicy <policy-string> -j <job_id>
      Example (turn auto-migration on and change one parameter from its default value):
      float modify --migratePolicy [enable=true,mem.upperBoundRatio=70] -j eFehjVpxNFet426BzCFgz
      Warning: Are you sure you want to modify eFehjVpxNFet426BzCFgz?
      New migratePolicy may impact auto-migration behavior.(yes/No): yes
      Successfully modified eFehjVpxNFet426BzCFgz:  --migratePolicy [enable=true,mem.upperBoundRatio=70]
    • Web interface: Go to the Jobs screen and locate your job by ID or Name. Under the Actions column, click on the Modify Jobicon. Fill out the fields in the pop-up screen and then click on Modify.
  5. View a record of any migration events.
    • CLI: Use the float log cat job.events -j <job_id> command.
    • Web interface: On the Jobs screen, click on your job, and then go to the Attachments tab. Click on the Preview icon next to the job.events log.