NVIDIA™ GPUs Support in Accelerator

Start a "vovtaskerroot" in each machine with one consumable hardware resource for each of the GPUs on that machine. Call these resources GPU:Tesla<N> where N is an index starting from 0.

# In taskers.tcl
set res4gpus "GPU:Tesla0/1 GPU:Tesla1/1 GPU:Tesla2/1 GPU:Tesla3/1"
vtk_tasker_define nv001 -resources $res4gpus
vtk_tasker_define nv002 -resources $res4gpus
# ... 
vtk_tasker_define nvXXX -resources $res4gpus

Define job resources of type G:Tesla<J> where J is the number of GPUs that are requested by the job. For each J you need to define the maps from the job resource to the HW resources of type GPU:TeslaN.

For example, G:Tesla1 is easy and it needs to map to the OR of any of the available GPUs, while G:Tesla4 is also easy because it needs to map to the AND of all 4 devices.

# In resources.tcl
set mapOR  "GPU:Tesla0/1 OR GPU:Tesla1/1 OR GPU:Tesla2/1 OR GPU:Tesla3/1"
set mapAND "GPU:Tesla0/1    GPU:Tesla1/1    GPU:Tesla2/1    GPU:Tesla3/1"
vtk_resourcemap_set G:Tesla1 -max unlimited -map $mapOR 
# ...
vtk_resourcemap_set G:Tesla4 -max unlimited -map $mapAND

The other maps for G:Tesla2 and G:Tesla3 are more complex, and for larger values of N it is not feasible to use all possible combinations of devices. To help compute those maps, and to reduce the number of combinations to a workable subset, we provide a procedure called findCombinations in $VOVDIR/scripts/hero/nvidia/hero_nvidia_resources.tcl. Feel free to copy that file into your resources.tcl file.

proc findCombinations { list n } {
    #
    # Recursive procedure to find all combinations of 'n' elements from 'list'. 
    # 
    set result {}
    set l [llength $list]
    if { $l >= $n } {
        set inc 1
        for { set i 0 } { $i < $l } { incr i $inc } {
            set elem      [lindex $list $i] 
            set subList   [lreplace $list 0 $i]
            set subCombos {}
            if { $n > 1 } {
                set subCombos [findCombinations $subList [expr $n-1]]
                if { $l > 2 && [llength $subCombos] > 1 } {
                    ### When the combinations are too-many, use only the first combo.
                    set subCombos [lrange $subCombos 0 0]
                }
                foreach subCombo $subCombos {
                    lappend result [concat $elem $subCombo]
                }
            } else {
                lappend result $elem
            }
        }
    }
    return $result
}

## Maximum number of GPUS in any of the farm machines.
## Call the GPUS  "GPU:Tesla0 ... GPU:Tesla3 ..."
set MAX  8
set GPUS {}
for { set i 0 } { $i < $MAX } { incr i } {
    lappend GPUS "GPU:Tesla$i/1"
}

# A resource to count how many GPUs are in use.
vtk_resourcemap_set G:TeslaNum -total unlimited 

for { set i 1 } { $i <= $MAX } { incr i } {
    set options [findCombinations $GPUS $i ]
    set optionsWithParentheses ""
    set sep ""
    foreach opt $options {
        if { [llength $opt] > 1 } { 
            append optionsWithParentheses "$sep ( $opt )"
        } else {
            append optionsWithParentheses "$sep $opt"
        }
        set sep " OR"
    }
    vtk_resourcemap_set G:Tesla$i -total unlimited -map "G:TeslaNum#$i ( $optionsWithParentheses )"
}

Submit your workload using the wrapper vovgpu which is a script that interprets the "SOLUTION" computed by the scheduler and passes the selected list of devices to the application via the environment variable VOV_GPUSET or with the macro @GPUSET@. Note that VOV_GPUSET gives you a space-separated list of devices, while @GPUSET@ gives you a comma-separated list.

With this setup, you can submit arbitrary workloads which request any number of GPUs.

% nc run -r G:Tesla1 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla2 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla3 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla4 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2

If you are new to Accelerator, it is worth remembering that you can get project tracking if you use the -jobproj option. For FairShare, use the -g and -sg options in nc run, as in this example:

% nc run -r G:Tesla4  -jobproj MachineLearnAboutCats  -g /bu/ai/rd   vovgpu my_ml_app