Distributed Parallel Support

Accelerator supports jobs that require multiple CPUs that can be located on different machines.

When a Distributed Parallel (DP) job is submitted to the scheduler, the job is divided into partial jobs. Each partial job can require a different set of resources and cumulatively requires all the resources of the Distributed Parallel job. Accelerator schedules each partial job separately. When all partial jobs have been dispatched, one designated partial job executes the actual Distributed Parallel job. Depending on the submission method, there are several ways to take advantage of the computing resources assigned to the other partial jobs.

Submission Methods

There are three common methods to submit Distributed Parallel jobs.

Use -dp to submit a Distributed Parallel job with N partial jobs

This method creates a Distributed Parallel job with two partial jobs. When both partial jobs have been dispatched, the active partial job becomes the master and begins the execution of the command sleep 10, while the other partial job waits for the previous partial job to terminate. By default, the active partial job is the first one.

For example:

% nc run -dp 2 sleep 10

This method is rarely used due to the burden it puts on the tool integrator to find a way to use the other partial jobs.

Use -dp N with vovparallel LSB_HOSTS

When all N partial jobs have been dispatched,the active partial jobs executes vovparallel, which sets the environment variable LSB_HOSTS before invoking the master command. LSB_HOSTS contains the list of all hosts, possibly with repetitions, currently set aside to run all partial jobs of the Distributed Parallel job. The master command is expected to use rsh or ssh to reach out to those hosts and launch the appropriate software.

In this case, the active component displays the value of the LSB_HOSTS environment variable, showing the list of hosts assigned to the DP job group.

For example:

% nc run -dp 2 -wl vovparallel lsbhosts printenv LSB_HOSTS

Use -dp N together with vovparallel clone

This method clones the specified wrapper script across the selected hosts. The wrapper script then executes the tool-specific commands on the appropriate hosts and will have access to the Distributed Parallel environment variables and job properties.

For example:

nc run -dp 2 vovparallel clone mywrapper

In this case, the active partial job executes vovparallel clone ... which in turn connects to the other partial jobs to invoke N similar instances of the script mywrapper on all partial jobs.

This is the preferred method because it doesn't require ssh/rsh functionality to activate the partial jobs, and it allows Accelerator to track memory and CPU utilization for each partial jobs.

At execution time, each partialTool job sets some environment variables that help each partial jobs of the Distributed Parallel job understand the role to play. The variables are:

VOV_DP_TOPJOBID, with the VovId of the top level Distributed Parallel job
VOV_DP_COUNT, the overall number of partial jobs used for the Distributed Parallel job
VOV_DP_RANK, a number from 1 to VOV_DP_COUNT used to identify each partial jobs

Partial Job Rank and Resources

The active partial job, which starts the execution upon completion of dispatching of all partial jobs, is normally component number 1. This can be overridden by using the dp active argument.

% nc run -dp 8 -dpactive 8 uname -a

To define the resource list for each component of the Distributed Parallel job, use the -dpres option. This option takes a comma-separated list of simple resources lists. The first element in the list is used for the first partial job, the second element for the second partial job and so on. If there are more partial jobs than elements in the list, the last element in the list is used for all remaining partial jobs.

In the following example, the first partial job is dispatched to a Linux machine, the second partial job to a macosx machine, and all other partial jobs go to UNIX machines. The active partial job is number 2, which is the one running on macosx. The tasker that is running UNIX would becomes the master for the Distributed Parallel job set. This also works with tasker names, or any other tasker resource for that matter (kernel, RAM, etc.).

% nc run -dp 10 -dpres "linux,macosx,unix " -dpactive 2 vovparallel clone mywrapper

Distributed Parallel Properties

When a Distributed Parallel job is submitted, the partialTool job sets the following properties on the top-level job:

DP_ACTIVERANK: The rank of the active partial job, which is the partial job that launches the top-level job.
DP_ATTEMPTS: The number of times the partial jobs have failed to be allocated within the allowed time.
DP_COHORTWAIT: This property is created when -nocohortwait is passed with nc run. This instructs partialTool for each cohort task to finish when its subtask process has finished rather than wait for the primary job to complete (which is the default behavior). Using -nocohortwait sets DP_COHORTWAIT to zero (0). Otherwise, the default is 1.
DP_FAILURE: Explanation of failures.
DP_HOSTS: List of hosts for the distributed parallel job . This property is set when the last partial job is launched. The list may contain duplicates.
DP_PORT_X: This is set by partial tool X to the pair host dp_port.
DP_SEMAPHORE: Used by partial jobs to count the rank.
DP_SEMAPHORE2: A second synchronization counter used by partialTool.
DP_SETID: The ID of the set that contains the list of partial jobs.
DP_WAIT: Tells the partial job how long to wait before giving up the slot (default 30 sec).
DP_WAIT_MAX: This is the maximum allowed value for DP_WAIT for a job (default 30 mins).

In addition to these properties on the top-level job, the following property is set on all other jobs:

DP_PART_OF: The top-level job ID.

Properties can be accessed via the vovprop command or via the Tcl API with vtk_prop_get. Once obtained, they can be used in conjunction with rsh/ssh, for example, to interact with the chosen hosts.

Using Different Resources and Jobclasses

When specifying DP jobs, you can stack jobclasses so that the primary and component jobs have different resources and job classes.

This is done by setting VOV_JOB_DESC(dp,resources) and VOV_JOB_DESC(dp,jobclasses) to specify the resources and jobclass labels for the master and subcomponent DP jobs.

For example, two jobclass definitions mycalibre_a.tcl and mycalibre_b.tcl are defined as follows:


::::::::::::::
mycalibre_a.tcl
::::::::::::::
# Copyright (c) 1995-2021, Altair Engineering
# All Rights Reserved.

# $Id: //vov/trunk/src/scripts/jobclass/short.tcl#3 $


set classDescription "My Calibre job resources"

puts "This is mycalibre jobclass"
if { [ info exists VOV_JOB_DESC(dp,resources) ] } { 
  set VOV_JOB_DESC(dp,resources) [ string cat $VOV_JOB_DESC(dp,resources) ",RAM/200 CORES/2" ]
  set VOV_JOB_DESC(dp,jobclasses) [ string cat $VOV_JOB_DESC(dp,jobclasses) ",mycalibre_a" ]
} else { 
  set VOV_JOB_DESC(dp,resources) "RAM/200 CORES/2"
  set VOV_JOB_DESC(dp,jobclasses) "mycalibre_a"
}

proc initJobClass {} {
}
::::::::::::::
mycalibre_b.tcl
::::::::::::::
# Copyright (c) 1995-2021, Altair Engineering
# All Rights Reserved.

# $Id: //vov/trunk/src/scripts/jobclass/short.tcl#3 $

set classDescription "My Calibre job resources"

puts "This is mycalibre2.tcl"
if { [info exists VOV_JOB_DESC(dp,res,*)] }  { 
  puts "VOV_JOB_DESC(dp,res,*) already exists.  This is good!" 
} else { 
  puts "VOV_JOB_DESC(dp,res,*) does not appear to exist.  This is bad."
}

if { [ info exists VOV_JOB_DESC(dp,resources) ] } { 
  set VOV_JOB_DESC(dp,resources) [ string cat $VOV_JOB_DESC(dp,resources) ",RAM/400 CORES/4" ]
  set VOV_JOB_DESC(dp,jobclasses) [ string cat $VOV_JOB_DESC(dp,jobclasses) ",mycalibre_b" ]

} else { 
  set VOV_JOB_DESC(dp,resources) "RAM/400 CORES/4"
  set VOV_JOB_DESC(dp,jobclasses) "mycalibre_b"
}

proc initJobClass {} {
}

You then call nc run as:

nc run -v 5 -C mycalibre -C mycalibre2 -e BASE -J jeffjob -dp 4 vovparallel clone sleep 1

This results in the primary job having the jobclass "mycalibre_a" and the resources "RAM/200 CORES/2" while the secondary jobs have a jobclass of "mycalibre_b" and the resources "RAM/400 CORES/4".

Distributed Parallel Slot Timeout

With all of the methods described above, each partial job waits a finite amount of time for all other components to show up . If the time elapses, then the partial job gives up, fails, and returns the slot to the farm. The wait time is 30 seconds by default, but this may be larger if the farm is heavily loaded. The wait time can be specified using the dpwait option.

Regardless of the specified wait time, there is a maximum wait time (default 30 mins) that can be specified by manually setting the Distributed Parallel_WAIT_MAX property on the job. Use the P option with nc run, such as

-P DP_WAIT_MAX=12.0

vovparallel Clone

An example of a script used with vovparallel clone is available in $VOVDIR/training/vnc/simple_dp_script.csh.

#!/bin/csh -f
#
# Example of a script to be used with  vovparallel clone.
# 
# Example of usage:
# % nc run -dp 4 simple_dp_script.csh
#

set sleepTime = 30
if ( $#argv > 1 ) then
   set sleepTime = $1
endif

# Each application has its own way to determine a rendezvous port.
# In this example, it is a fixed port.
set APPLICATION_PORT = 2345

if ( ! $?VOV_DP_RANK ) then
    echo "ERROR: Variable VOV_DP_RANK not defined."
    echo "       This script needs to be run with vovparallel clone ..."
    exit 0
endif

echo "Hello! I am component $VOV_DP_RANK"

if ( $VOV_DP_RANK == 1 ) then
    echo "This is the master. "
    echo "This should open a socket for communication e.g. $APPLICATION_PORT"
    set DP_HOSTS      = `vovprop GET $VOV_DP_TOPJOBID DP_HOSTS`
    echo $DP_HOSTS
    sleep $sleepTime
else 
    echo "This is the tasker"
    set DP_HOSTS      = `vovprop GET $VOV_DP_TOPJOBID DP_HOSTS`
    set masterHost = $DP_HOSTS[1]
    echo "This should communicate with master through ${masterHost}:$APPLICATION_PORT"
    sleep $sleepTime

endif

exit 0

A more advanced script to use with vovparallel clone is available in $VOVDIR/eda/MentorGraphics/vovcalibremt.

OpenMPI Support

If you have an application that uses OpenMPI, you can submit it as a Distribute Parallel application (options -dp, -dpres, etc.) and you need to use the wrapper "vovmpirun".

For example, if you want the application to run with N components (possibly on different hosts), you can submit it with:

% nc run -dp <N>  vovmpirun ./path/to/application

Note: This support relies on the fact that OpenMPI uses ssh to start orted on the remote hosts. OpenMPI is forced to use $VOVDIR/hidden_mpi/ssh.