If you have machines with multiple GPUs, you can harness the power of those devices
by following these guidelines, which have been tested with up to 8 GPUs per machine.
-
Start a "vovtaskerroot" in each machine with one consumable
hardware resource for each of the GPUs on that machine. Call these resources
GPU:Tesla<N>
where N
is an index
starting from 0.
# In taskers.tcl
set res4gpus "GPU:Tesla0/1 GPU:Tesla1/1 GPU:Tesla2/1 GPU:Tesla3/1"
vtk_tasker_define nv001 -resources $res4gpus
vtk_tasker_define nv002 -resources $res4gpus
# ...
vtk_tasker_define nvXXX -resources $res4gpus
-
Define job resources of type
G:Tesla<J>
where
J
is the number of GPUs that are requested by the job. For
each J
you need to define the maps from the job resource to the
HW resources of type GPU:TeslaN
.
For example,
G:Tesla1
is easy and it needs to map to the
OR of any of the available GPUs, while
G:Tesla4 is also
easy
because it needs to map to the AND of all 4
devices.
# In resources.tcl
set mapOR "GPU:Tesla0/1 OR GPU:Tesla1/1 OR GPU:Tesla2/1 OR GPU:Tesla3/1"
set mapAND "GPU:Tesla0/1 GPU:Tesla1/1 GPU:Tesla2/1 GPU:Tesla3/1"
vtk_resourcemap_set G:Tesla1 -max unlimited -map $mapOR
# ...
vtk_resourcemap_set G:Tesla4 -max unlimited -map $mapAND
The
other maps for
G:Tesla2
and
G:Tesla3
are
more complex, and for larger values of
N
it is not feasible
to use all possible combinations of devices. To help compute those maps, and
to reduce the number of combinations to a workable subset, we provide a
procedure called
findCombinations in
$VOVDIR/scripts/hero/nvidia/hero_nvidia_resources.tcl.
Feel free to copy that file into your
resources.tcl
file.
proc findCombinations { list n } {
#
# Recursive procedure to find all combinations of 'n' elements from 'list'.
#
set result {}
set l [llength $list]
if { $l >= $n } {
set inc 1
for { set i 0 } { $i < $l } { incr i $inc } {
set elem [lindex $list $i]
set subList [lreplace $list 0 $i]
set subCombos {}
if { $n > 1 } {
set subCombos [findCombinations $subList [expr $n-1]]
if { $l > 2 && [llength $subCombos] > 1 } {
### When the combinations are too-many, use only the first combo.
set subCombos [lrange $subCombos 0 0]
}
foreach subCombo $subCombos {
lappend result [concat $elem $subCombo]
}
} else {
lappend result $elem
}
}
}
return $result
}
## Maximum number of GPUS in any of the farm machines.
## Call the GPUS "GPU:Tesla0 ... GPU:Tesla3 ..."
set MAX 8
set GPUS {}
for { set i 0 } { $i < $MAX } { incr i } {
lappend GPUS "GPU:Tesla$i/1"
}
# A resource to count how many GPUs are in use.
vtk_resourcemap_set G:TeslaNum -total unlimited
for { set i 1 } { $i <= $MAX } { incr i } {
set options [findCombinations $GPUS $i ]
set optionsWithParentheses ""
set sep ""
foreach opt $options {
if { [llength $opt] > 1 } {
append optionsWithParentheses "$sep ( $opt )"
} else {
append optionsWithParentheses "$sep $opt"
}
set sep " OR"
}
vtk_resourcemap_set G:Tesla$i -total unlimited -map "G:TeslaNum#$i ( $optionsWithParentheses )"
}
-
Submit your workload using the wrapper vovgpu which is a
script that interprets the "SOLUTION" computed by the scheduler and passes the
selected list of devices to the application via the environment variable VOV_GPUSET or
with the macro @GPUSET@. Note that VOV_GPUSET
gives you a space-separated list of devices, while @GPUSET@ gives you a
comma-separated list.
With this setup, you can submit arbitrary workloads which
request any number of GPUs.
% nc run -r G:Tesla1 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla2 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla3 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
% nc run -r G:Tesla4 -- vovgpu some_job_requiring_gpu -devices=@GPUSET@ -arg1 -arg2
If
you are new to
Accelerator, it is worth remembering that you can
get project tracking if you use the
-jobproj option. For
FairShare, use the
-g and
-sg
options in
nc run,
as in this
example:
% nc run -r G:Tesla4 -jobproj MachineLearnAboutCats -g /bu/ai/rd vovgpu my_ml_app