Theory of Operation

This section describes the theory of the operation of Accelerator Plus on Accelerator and how to address problems should they occur.

Theory of Operation: Connecting Accelerator Plus to Accelerator

In general, Accelerator Plus connects to a base scheduler with an elastic grid daemon.

Accelerator Plus connects to Accelerator with the daemon vovwxd. The run directory for this daemon is .swd/vovwxd. The three significant files are vovwxd.log, config.tcl, and vovnc.tcl.
  • The config.tcl file allows the elastic parameters to be configured.
  • A fresh timestamp on the log file confirms that the elastic daemon is running.
    Note: It is essential that the elastic daemon is running for jobs to be submitted to the base scheduler.
A change to the config.tcl file is automatically picked up by the vovwxd. A crashed or halted daemon can be restarted with the command:
% wx cmd vovdaemonmgr start
vovresourced             ALREADY RUNNING
vovdbd                   ALREADY RUNNING
vovwxd                   STARTING...

The vovwxd daemon watches the Accelerator Plus workload; if some bucket is waiting for hardware or software resources, then the daemon issues a request to the base scheduler for more resources.

When the underlying scheduler dispatches this job to an execution host, the job fires a new vovtasker that connects back to the Accelerator Plus, therefore advertising new computing resources that can be used to process the jobs queued in Accelerator Plus.

When this occurs, you will see an extra box appear in the LED monitor of the vovconsole (located on the upper bar of the vovconsole main window). The taskers started by an elastic daemon appear as additional boxes. The box size decreases as more taskers are added.

Key Notes

  • vovwxd sends "tasker job requests" to the base scheduler; the actual workload in Accelerator Plus stays in Accelerator Plus and is executed as the base scheduler satisfies the requests from vovwxd
  • The base scheduler only sees tasker requests. It has no direct visibility into the details of the Accelerator Plus workload, although the tasker's resource request (license, limits, memory, cores etc) precisely matches the underlying workload.
  • To Accelerator Plus, it sees a "private" pool of taskers to which it can submit jobs. Being elastic, the pool may grow or shrink.
  • Typically, the taskers requested by vovwxd offer only one slot, so that only one job at a time can be run on one of these taskers.
Accelerator Plus taskers terminate when one of two conditions is met:
  1. The maximum idle time is exceeded. This is controlled by the parameter "Max Idle Time" represented by the variable CONFIG(tasker,maxidle) in config.tcl, which is typically 30s
  2. The maximum life is exceeded and the tasker is idle. This is controlled by the parameter CONFIG(tasker,maxlife) and has a default value of 1w (1 week), but typical values can be 1h or 2h. This allows FairShare policies on the underlying queue to be respected by keeping the maximum job duration in Accelerator to some reasonable range.

When a maximum idle is exceeded, the tasker stops accepting jobs by declaring itself to be "DONE"; the tasker turns blue in the Tasker Monitor status column. After a few seconds the tasker exits and the entry for it disappears from the monitor. At that time, the job representing the request for that tasker will terminate in Accelerator.

When the maximum life is exceeded, the tasker declares itself as suspended (purple) and no new jobs are sent to it, while all jobs currently running on that suspended taskers are allowed to continue running and to terminate normally. The tasker will exit when no more jobs are running on it. This behavior is essential for maintaining adherence to FairShare policies.

A normal Accelerator Plus tasker will go through the following states:
REQUESTED
Request for a new tasker resource has been issued to base scheduler.
NOLIC
Tasker is connecting and is waiting to be given a license.
READY
Tasker is ready to receive a job.
FULL
Tasker is running a job.
OVERLOADED
The load on the tasker is higher than a specified threshold.
SUSPENDED
Tasker is no longer accepting jobs.
PAUSED
Tasker has been paused by the base scheduler.
DONE
Done, about to be deleted.

When Things Go Well

In Accelerator Plus's Tasker Monitor, you will see a reasonable number of taskers listed; most are classed as FULL and some as REQUESTED. Normally, each tasker will have 1 slot and that will be occupied. For freshly created taskers they may go into a NOLIC state for a brief period while they establish their connection, license and capabilities. Some taskers may be SUSPENDED/purple - they're finishing off the last job they've been allotted before max life kicks in. Some status will be READY/green for a brief period; most will be FULL/yellow (running a job). A few will be red indicating an overload.

From the Accelerator perspective, you can look at a set called "WXTaskers:wx" to see all requested jobs coming from vovwxd. Each set should have some number of valid jobs (taskers that have exited normally) and few running/orange jobs (the current work being done) and a small number of queued/cyan jobs. Only a small number of queued jobs should ever be present (e.g. 10-50) even though the Accelerator Plus session may have several hundred thousand jobs ready to run. The vovwxd daemon will continue to replenish the pending number as they get executed, so that the Accelerator queue size remains small. This does not impact FairShare negatively as only a single waiting job causes FairShare to kick in; teams that use Accelerator Plus will not be at a disadvantage just because a handful of jobs are present in Accelerator at any one time compared with others that run their entire workload in Accelerator.