Handling Job Failures

The error policy determines how job failures are handled.

Feature Description

If a job fails (or fails to complete and times out), the OpCenter's default behavior is to reclaim all cloud resources, including any local storage volumes (such as EBS volumes in AWS). This means that any files that store configuration data or metadata are no longer available. For most applications this is the desired behavior, but for some applications, such as RStudio, recreating all the configuration data (for example, usernames and passwords) can be onerous.

To change the default behavior, you can submit the job with an error policy that specifies whether to retain the storage volumes and whether to restart the application.

Configuring Error Policy

You must specify the error policy using the float submit command from the CLI. The web interface does not support a configuration option for error policy.

Submit the job from the CLI as follows.

float submit CMDS_AND_OPTIONS --errPolicy ERROR_POLICY

Replace:

CMDS_AND_OPTIONS with the required and optional subcommands and flags to submit a job, for example, image name, job script, and instance type.
ERROR_POLICY with one of the following.
- reclaimAll: If the job fails to complete, reclaim the virtual machine and all local storage volumes (the default).
- retainVolumes: if the job fails to complete, reclaim the virtual machine and retain all local storage volumes. The user must manually resume job with float resume -j JOB_ID (or cancel job). The current local storage volumes are mounted on a new virtual machine instance.
- restart: if the job fails to complete, retain virtual machine and all local storage volumes. The OpCenter removes any snapshot files and restarts the job on the existing virtual machine with the local storage volumes in the state they were in when the job failed.
  Warning: With the "restart" policy, the application may continue to run without stopping unless the application limits the number of restarts.

Modifying Error Policy

You can modify the error policy associated with a running job by using the following command.

float modify -j JOB_ID --errPolicy ERROR_POLICY

Replace

JOB_ID: identifier of job to associate with new error policy
ERROR_POLICY: new error policy to apply to job

Example of retainVolumes policy

To demonstrate how the --errPolicy retainVolumes policy works, you can simulate an error that forces a job from the "Executing" state to the "Stopped" state. In the example shown, a shell script counts numbers starting from 1.

Submit a job with the error policy set to retainVolumes.

float submit -i quay.io/centos/centos:stream9 -j progretrains3.sh -c2 -m 4 --dataVolume [size=10]:/data --errorPolicy retainVolumes -n retain

The job starts normally.

float squeue
+-----------------------+---------------------+------------------------------------+-------+-----------------+...
|          ID           |        NAME         |            WORKING HOST            | USER  |     STATUS      | 
+-----------------------+---------------------+------------------------------------+-------+-----------------+...

| ce2xmrgiife4nf68c50m3 | retain              | 54.162.12.226 (2Core4GB/OnDemand)  | admin | Executing   ...

float hosts
+---------------------+-----------+-------------------------+----------+----------------+...
|       ENTITY        |  STATUS   |      INSTANCETYPE       | PAYTYPE  |    PUBLICIP    |
+---------------------+-----------+-------------------------+----------+----------------+...
| i-0ad134ebc35da8351 | normal    | t3.medium(2 vCPU, 4 GB) | OnDemand | 54.162.12.226  |...
(edited)

A shell script writes output to a file in a local EBS volume (/data)

this is a test of retainVolumes
1 is NOT a multiple of three
2 is NOT a multiple of three
3 is a multiple of three
...

The simulated error forces the job into the "Stopped" state. The virtual machine is reclaimed.

float squeue
+-----------------------+---------------------+------------------------------------+-------+-----------------+...
|          ID           |        NAME         |            WORKING HOST            | USER  |     STATUS      | 
+-----------------------+---------------------+------------------------------------+-------+-----------------+...

| ce2xmrgiife4nf68c50m3 | retain              | 54.162.12.226 (2Core4GB/OnDemand)  | admin | Stopped   ...

float hosts
+---------------------+-----------+-------------------------+----------+----------------+...
|       ENTITY        |  STATUS   |      INSTANCETYPE       | PAYTYPE  |    PUBLICIP    |
+---------------------+-----------+-------------------------+----------+----------------+...
| i-0ad134ebc35da8351 | reclaimed | t3.medium(2 vCPU, 4 GB) | OnDemand | 54.162.12.226  |...
(edited)

The shell script stops writing output.

...
98 is NOT a multiple of three
99 is a multiple of three
100 is NOT a multiple of three
Terminated

Resume the job with the following command:

float resume -j ce2xmrgiife4nf68c50m3
Resume request has been submitted. Please check job status to see resuming progress.

The job resumes on a new virtual machine.

float squeue
+-----------------------+---------------------+------------------------------------+-------+-----------------+...
|          ID           |        NAME         |            WORKING HOST            | USER  |     STATUS      | 
+-----------------------+---------------------+------------------------------------+-------+-----------------+...

| ce2xmrgiife4nf68c50m3 | retain              | 54.243.14.97 (2Core4GB/OnDemand)   | admin | Executing   ...

float hosts
+---------------------+-----------+-------------------------+----------+----------------+...
|       ENTITY        |  STATUS   |      INSTANCETYPE       | PAYTYPE  |    PUBLICIP    |
+---------------------+-----------+-------------------------+----------+----------------+...
| i-0ac9f4e01c82a03da | normal    | t3.medium(2 vCPU, 4 GB) | OnDemand | 54.243.14.97   |...
(edited)

The shell script starts from the beginning and resumes writing output to the same file (which confirms that the EBS volume attached to the original virtual machine is remounted on the new virtual machine).
```
97 is NOT a multiple of three
98 is NOT a multiple of three
99 is a multiple of three
100 is NOT a multiple of three
1 is NOT a multiple of three
2 is NOT a multiple of three
3 is a multiple of three
...
```

Example of restart policy

To demonstrate how the --errPolicy restart policy works, you can insert an exit 1 statement in your code to simulate a job failure. In the example shown, a simple R program fails and restarts.

Submit a job with the error policy set to restart

float submit -i tidyverse -j run_error.sh -c2 -m 4 --dataVolume [size=10]:/data --errPolicy restart -n restartexit1

The job starts normally.

float squeue
+-----------------------+---------------------+------------------------------------+-------+-----------------+...
|          ID           |        NAME         |            WORKING HOST            | USER  |     STATUS      | 
+-----------------------+---------------------+------------------------------------+-------+-----------------+...

| 5x53vth2ovo1n3n1r66db | restartexit1        | 54.152.156.41 (2Core4GB/OnDemand)  | admin | Executing       |...

float hosts
+---------------------+-----------+-------------------------+----------+----------------+-----------------------...
|       ENTITY        |  STATUS   |      INSTANCETYPE       | PAYTYPE  |    PUBLICIP    |    TAGS
+---------------------+-----------+-------------------------+----------+----------------+------------------------...
| i-0c3e15e4f442ffd56 | normal    | t3.medium(2 vCPU, 4 GB) |   Spot   | 54.152.156.41  |MMCE-JOB-ID:5x53vth2ovo1 ... 
(edited)

The R program generates output.

Downloaded R script to /data/genericr, ready to test
[1] "Ready to write to log file"
[1] "1 starting"
[1] "Tue Dec 26 07:15:50 PM 2023"
[1] "1 is not 3 times"
[1] "Tue Dec 26 07:15:53 PM 2023"
[1] "2 is not 3 times"
[1] "Tue Dec 26 07:15:56 PM 2023"
[1] "3 is 3 times"
    ...

The exit 1 command causes the job to fail and restart on the same host with the same local EBS volume (/data)

[1] "100 is not 3 times"
Cause the job to exit with error code 1
Downloaded R script to /data/genericr, ready to test
[1] "Ready to write to log file"
[1] "1 starting"
[1] "Tue Dec 26 06:08:21 PM 2023"
[1] "1 is not 3 times"
[1] "Tue Dec 26 06:08:24 PM 2023"
[1] "2 is not 3 times"
[1] "Tue Dec 26 06:08:27 PM 2023"
[1] "3 is 3 times"
...
float hosts
+---------------------+-----------+-------------------------+----------+----------------+-----------------------...
|       ENTITY        |  STATUS   |      INSTANCETYPE       | PAYTYPE  |    PUBLICIP    |    TAGS
+---------------------+-----------+-------------------------+----------+----------------+------------------------...
| i-0c3e15e4f442ffd56 | normal    | t3.medium(2 vCPU, 4 GB) |   Spot   | 54.152.156.41  |MMCE-JOB-ID:5x53vth2ovo1 ... 
(edited)