NVIDIA™ GPUs Support in Accelerator

If you have machines with multiple GPUs, you can harness the power of those devices by following these guidelines, which have been tested with up to 8 GPUs per machine.

  1. Start a "vovtaskerroot" in each machine with one consumable hardware resource for each of the GPUs on that machine. Call these resources GPU:Tesla<N> where N is an index starting from 0.
    # In taskers.tcl
    set res4gpus "GPU:Tesla0/1 GPU:Tesla1/1 GPU:Tesla2/1 GPU:Tesla3/1"
    vtk_tasker_define nv001 -resources $res4gpus
    vtk_tasker_define nv002 -resources $res4gpus
    # ... 
    vtk_tasker_define nvXXX -resources $res4gpus

  2. Define job resources of type G:Tesla<J> where J is the number of GPUs that are requested by the job. For each J you need to define the maps from the job resource to the HW resources of type GPU:TeslaN.
    For example, G:Tesla1 is easy and it needs to map to the OR of any of the available GPUs, while G:Tesla4 is also easy because it needs to map to the AND of all 4 devices.
    # In resources.tcl
    set mapOR  "GPU:Tesla0/1 OR GPU:Tesla1/1 OR GPU:Tesla2/1 OR GPU:Tesla3/1"
    set mapAND "GPU:Tesla0/1    GPU:Tesla1/1    GPU:Tesla2/1    GPU:Tesla3/1"
    vtk_resourcemap_set G:Tesla1 -max unlimited -map $mapOR 
    # ...
    vtk_resourcemap_set G:Tesla4 -max unlimited -map $mapAND 
    The other maps for G:Tesla2 and G:Tesla3 are more complex, and for larger values of N it is not feasible to use all possible combinations of devices. To help compute those maps, and to reduce the number of combinations to a workable subset, we provide a procedure called findCombinations in $VOVDIR/scripts/hero/nvidia/hero_nvidia_resources.tcl. Feel free to copy that file into your resources.tcl file.
    proc findCombinations { list n } {
        #
        # Recursive procedure to find all combinations of 'n' elements from 'list'. 
        # 
        set result {}
        set l [llength $list]
        if { $l >= $n } {
            set inc 1
            for { set i 0 } { $i < $l } { incr i $inc } {
                set elem      [lindex $list $i] 
                set subList   [lreplace $list 0 $i]
                set subCombos {}
                if { $n > 1 } {
                    set subCombos [findCombinations $subList [expr $n-1]]
                    if { $l > 2 && [llength $subCombos] > 1 } {
                        ### When the combinations are too-many, use only the first combo.
                        set subCombos [lrange $subCombos 0 0]
                    }
                    foreach subCombo $subCombos {
                        lappend result [concat $elem $subCombo]
                    }
                } else {
                    lappend result $elem
                }
            }
        }
        return $result
    }
    
    ## Maximum number of GPUS in any of the farm machines.
    ## Call the GPUS  "GPU:Tesla0 ... GPU:Tesla3 ..."
    set MAX  8
    set GPUS {}
    for { set i 0 } { $i < $MAX } { incr i } {
        lappend GPUS "GPU:Tesla$i/1"
    }
    
    # A resource to count how many GPUs are in use.
    vtk_resourcemap_set G:TeslaNum -total unlimited 
    
    for { set i 1 } { $i <= $MAX } { incr i } {
        set options [findCombinations $GPUS $i ]
        set optionsWithParentheses ""
        set sep ""
        foreach opt $options {
            if { [llength $opt] > 1 } { 
                append optionsWithParentheses "$sep ( $opt )"
            } else {
                append optionsWithParentheses "$sep $opt"
            }
            set sep " OR"
        }
        vtk_resourcemap_set G:Tesla$i -total unlimited -map "G:TeslaNum#$i ( $optionsWithParentheses )"
    }
  3. Submit your workload using the wrapper vovgpu which is a script that interprets the "SOLUTION" computed by the scheduler and passes the selected list of devices to the application via the environment variable VOV_GPUSET or with the macro @GPUSET@. Note that VOV_GPUSET gives you a space-separated list of devices, while @GPUSET@ gives you a comma-separated list.
With this setup, you can submit arbitrary workloads which request any number of GPUs.
% nc run -r G:Tesla1 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla2 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla3 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla4 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2 
If you are new to Accelerator, it is worth remembering that you can get project tracking if you use the -jobproj option. For FairShare, use the -g and -sg options in nc run, as in this example:
% nc run -r G:Tesla4  -jobproj MachineLearnAboutCats  -g /bu/ai/rd   vovgpu my_ml_app