NVIDIA™ GPUs Support in Accelerator
If you have machines with multiple GPUs, you can harness the power of those devices by following these guidelines, which have been tested with up to 8 GPUs per machine.
-
Start a "vovtaskerroot" in each machine with one consumable
hardware resource for each of the GPUs on that machine. Call these resources
GPU:Tesla<N>whereNis an index starting from 0.# In taskers.tcl set res4gpus "GPU:Tesla0/1 GPU:Tesla1/1 GPU:Tesla2/1 GPU:Tesla3/1" vtk_tasker_define nv001 -resources $res4gpus vtk_tasker_define nv002 -resources $res4gpus # ... vtk_tasker_define nvXXX -resources $res4gpus -
Define job resources of type
G:Tesla<J>whereJis the number of GPUs that are requested by the job. For eachJyou need to define the maps from the job resource to the HW resources of typeGPU:TeslaN.For example,G:Tesla1is easy and it needs to map to the OR of any of the available GPUs, whileG:Tesla4 is alsoeasy because it needs to map to the AND of all 4 devices.# In resources.tcl set mapOR "GPU:Tesla0/1 OR GPU:Tesla1/1 OR GPU:Tesla2/1 OR GPU:Tesla3/1" set mapAND "GPU:Tesla0/1 GPU:Tesla1/1 GPU:Tesla2/1 GPU:Tesla3/1" vtk_resourcemap_set G:Tesla1 -max unlimited -map $mapOR # ... vtk_resourcemap_set G:Tesla4 -max unlimited -map $mapANDThe other maps forG:Tesla2andG:Tesla3are more complex, and for larger values ofNit is not feasible to use all possible combinations of devices. To help compute those maps, and to reduce the number of combinations to a workable subset, we provide a procedure called findCombinations in $VOVDIR/scripts/hero/nvidia/hero_nvidia_resources.tcl. Feel free to copy that file into your resources.tcl file.proc findCombinations { list n } { # # Recursive procedure to find all combinations of 'n' elements from 'list'. # set result {} set l [llength $list] if { $l >= $n } { set inc 1 for { set i 0 } { $i < $l } { incr i $inc } { set elem [lindex $list $i] set subList [lreplace $list 0 $i] set subCombos {} if { $n > 1 } { set subCombos [findCombinations $subList [expr $n-1]] if { $l > 2 && [llength $subCombos] > 1 } { ### When the combinations are too-many, use only the first combo. set subCombos [lrange $subCombos 0 0] } foreach subCombo $subCombos { lappend result [concat $elem $subCombo] } } else { lappend result $elem } } } return $result } ## Maximum number of GPUS in any of the farm machines. ## Call the GPUS "GPU:Tesla0 ... GPU:Tesla3 ..." set MAX 8 set GPUS {} for { set i 0 } { $i < $MAX } { incr i } { lappend GPUS "GPU:Tesla$i/1" } # A resource to count how many GPUs are in use. vtk_resourcemap_set G:TeslaNum -total unlimited for { set i 1 } { $i <= $MAX } { incr i } { set options [findCombinations $GPUS $i ] set optionsWithParentheses "" set sep "" foreach opt $options { if { [llength $opt] > 1 } { append optionsWithParentheses "$sep ( $opt )" } else { append optionsWithParentheses "$sep $opt" } set sep " OR" } vtk_resourcemap_set G:Tesla$i -total unlimited -map "G:TeslaNum#$i ( $optionsWithParentheses )" } - Submit your workload using the wrapper vovgpu which is a script that interprets the "SOLUTION" computed by the scheduler and passes the selected list of devices to the application via the environment variable VOV_GPUSET or with the macro @GPUSET@. Note that VOV_GPUSET gives you a space-separated list of devices, while @GPUSET@ gives you a comma-separated list.
% nc run -r G:Tesla1 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla2 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla3 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla4 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2 If
you are new to Accelerator, it is worth remembering that you can
get project tracking if you use the -jobproj option. For FairShare, use the -g and -sg
options in nc run,
as in this
example:
% nc run -r G:Tesla4 -jobproj MachineLearnAboutCats -g /bu/ai/rd vovgpu my_ml_app