Multiphase Support

Multiphase support is provided by two additional command arguments to nc run:
-multiphase [1|0] Enables multiphase jobs.
-mpres "resource string" Sets the resources that will be used for each phase.
The '%' is used as a delimeter for the resources of each phase; for example, -mpres "linux64 foo%linux64 bar:linux64 baz".

In addition, the autoRescheduleCount server configuration parameter needs to be set to the max number of job phases or higher. The default is 4, so this applies to jobs with 5 or more phases.

Specifying Resources

By specifying the resources of each phase and designating that certain resources are only allocated to certain taskers, you can run different phases of a job on different taskers. The following options can be set with the nc run command:
mpres RESLIST
Specification of the resources required by a multiphase job. The RESLIST specifies the resource lists for all phases of the job with % characters delimiting the phases. The sublists of resources for each phase are percent sign delimited.
Example: -mpres "RAM/200 CORES/2%RAM/20 CORES/3"
mpres+ <rsrc>
Append one resource to the multiphase resource list. This option must follow the "-mpres" or "-mpres<n>" option, otherwise the resources specified in "-mpres+" will be overwritten.
-mpres1 RESLIST, -mpres2 RESLIST, -mpres<n> RESLIST
Specify resources for a stage <n> of a multiphase job. The number <n> is in the range from 1 to 9.
Example:
-mpres1 "RAM/200"
-mpres2 "CORES/4 RAM/10"
For example: You have two taskers named tasker1 and tasker2. I want to run phase 1 and 3 on tasker1, and phase 2 on tasker2. Your resources may look like:
vtk_resourcemap_set License:blue UNLIMITED License:blue_tasker1 
vtk_resourcemap_set License:red UNLIMITED License:red_tasker2 
vtk_resourcemap_set License:blue_tasker1 1 tasker1 
vtk_resourcemap_set License:red_tasker2 1 tasker2
You could then run a multiphase job as:
nc run -multiphase 1 -mpres "linux64 License:blue%linux64 License:red%linux64 License:blue" -- -e BASE -D /home/jjmcwill/testDir/testMultistage.sh 
A multiphase job will have two new Job Properties set:
MPRESOURCES Contains the same resources passed in -mpres, and is used to reset the job resources for each phase.
MPCURRENTPHASE Contains an integer indicating the current job phase. It starts at one, and has a max value of 9.
The running job script will see an environment variable named VOV_JOB_PHASE which is set to the current phase. The script writer will need to use that to decide what work to do for that phase.
  • If the script exits with an exit code of 216, Accelerator will increment the job phase, change the job resources, and reschedule the job to run again.
  • If the script exits with an exit code of 0, the job is considered "Done", and MPCURRENTPHASE is reset to 1.

Failed Jobs

If a job fails during a phase with a code other than 0 or 216, it is considered FAILED and MPCURRENTPHASE will not increment. If the job is invalided and re-run (such as, nc rerun -f JOBID), the job will re-run starting at MPCURRENTPHASE and further phases will run if the job exits with code 216, as described above.

Logging

After the first phase is run, subsequent phases of the job will have the command rewritten so that the wrappers are passed -a -A, telling the wrappers to append to the job log. This is so that all phases of the job get their stdout and stderr logged to the same file. If this was not done, each phase of the job would overwrite the log, and you would only see the output from the last phase that was run.

If Accelerator does not detect one of the standard vov wrappers at the beginning of the command line, it will assume the command is not using a wrapper. In this case, it will look for the standard >; redirect symbol in the command and replace it with >>;.

REST Support

In the payload for submitting a job via REST, two new fields are allowed: multiphase and mpres. Setting multiphase = True enables multiphase job support. Setting the mpres field behaves the same as described for the command line argument described above. Re-running a multiphase job that has failed via the REST re-run API will behave similarly to rerunning a failed multiphase job from the command line as described above.