NVIDIA™ GPUs Support in Accelerator
If you have machines with multiple GPUs, you can harness the power of those devices by following these guidelines, which have been tested with up to 8 GPUs per machine.
-
Start a "vovtaskerroot" in each machine with one consumable
hardware resource for each of the GPUs on that machine. Call these resources
GPU:Tesla<N>
whereN
is an index starting from 0.# In taskers.tcl set res4gpus "GPU:Tesla0/1 GPU:Tesla1/1 GPU:Tesla2/1 GPU:Tesla3/1" vtk_tasker_define nv001 -resources $res4gpus vtk_tasker_define nv002 -resources $res4gpus # ... vtk_tasker_define nvXXX -resources $res4gpus
-
Define job resources of type
G:Tesla<J>
whereJ
is the number of GPUs that are requested by the job. For eachJ
you need to define the maps from the job resource to the HW resources of typeGPU:TeslaN
.For example,G:Tesla1
is easy and it needs to map to the OR of any of the available GPUs, whileG:Tesla4 is also
easy because it needs to map to the AND of all 4 devices.# In resources.tcl set mapOR "GPU:Tesla0/1 OR GPU:Tesla1/1 OR GPU:Tesla2/1 OR GPU:Tesla3/1" set mapAND "GPU:Tesla0/1 GPU:Tesla1/1 GPU:Tesla2/1 GPU:Tesla3/1" vtk_resourcemap_set G:Tesla1 -max unlimited -map $mapOR # ... vtk_resourcemap_set G:Tesla4 -max unlimited -map $mapAND
The other maps forG:Tesla2
andG:Tesla3
are more complex, and for larger values ofN
it is not feasible to use all possible combinations of devices. To help compute those maps, and to reduce the number of combinations to a workable subset, we provide a procedure called findCombinations in $VOVDIR/scripts/hero/nvidia/hero_nvidia_resources.tcl. Feel free to copy that file into your resources.tcl file.proc findCombinations { list n } { # # Recursive procedure to find all combinations of 'n' elements from 'list'. # set result {} set l [llength $list] if { $l >= $n } { set inc 1 for { set i 0 } { $i < $l } { incr i $inc } { set elem [lindex $list $i] set subList [lreplace $list 0 $i] set subCombos {} if { $n > 1 } { set subCombos [findCombinations $subList [expr $n-1]] if { $l > 2 && [llength $subCombos] > 1 } { ### When the combinations are too-many, use only the first combo. set subCombos [lrange $subCombos 0 0] } foreach subCombo $subCombos { lappend result [concat $elem $subCombo] } } else { lappend result $elem } } } return $result } ## Maximum number of GPUS in any of the farm machines. ## Call the GPUS "GPU:Tesla0 ... GPU:Tesla3 ..." set MAX 8 set GPUS {} for { set i 0 } { $i < $MAX } { incr i } { lappend GPUS "GPU:Tesla$i/1" } # A resource to count how many GPUs are in use. vtk_resourcemap_set G:TeslaNum -total unlimited for { set i 1 } { $i <= $MAX } { incr i } { set options [findCombinations $GPUS $i ] set optionsWithParentheses "" set sep "" foreach opt $options { if { [llength $opt] > 1 } { append optionsWithParentheses "$sep ( $opt )" } else { append optionsWithParentheses "$sep $opt" } set sep " OR" } vtk_resourcemap_set G:Tesla$i -total unlimited -map "G:TeslaNum#$i ( $optionsWithParentheses )" }
- Submit your workload using the wrapper vovgpu which is a script that interprets the "SOLUTION" computed by the scheduler and passes the selected list of devices to the application via the environment variable VOV_GPUSET or with the macro @GPUSET@. Note that VOV_GPUSET gives you a space-separated list of devices, while @GPUSET@ gives you a comma-separated list.
% nc run -r G:Tesla1 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla2 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla3 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla4 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
If
you are new to Accelerator, it is worth remembering that you can
get project tracking if you use the -jobproj option. For FairShare, use the -g and -sg
options in nc run,
as in this
example:
% nc run -r G:Tesla4 -jobproj MachineLearnAboutCats -g /bu/ai/rd vovgpu my_ml_app