vovwxd

vovwxd is a daemon used for dynamically requesting taskers from another job scheduler, that is, a base queue. This process is referred to as elastic computing. This daemon is the key component of the Accelerator Plus product and is also used with FlowTracer when interfacing with an external job scheduler.

In general, vovwxd connects to a base queue and watches the local Accelerator Plus or FlowTracer workload; if a job bucket is waiting for hardware or software resources, vovwxd issues a request to the base queue for a tasker that will provide more resources to the local scheduler. This request is referred to as a launcher job. The launcher job runs on a local system tasker running under the name WXLauncher and is simply a job submission command to the base scheduler for the wxagent utility. As long as the base queue provides the resources the job is waiting for, the wxagent utility will run in the base queue and will spawn a vovtasker that connects back to the local vovserver. The workload is executed as the base scheduler satisfies the requests for more taskers from vovwxd. Typically, the taskers requested by vovwxd offer only one slot, so that only one job at a time can be run on one of these taskers.

The daemon is normally started automatically via the SWD/autostart/start_vovwxd.tcl script and is managed on-demand with the vovdaemonmgr utility. The daemon can be executed as a foreground process for debugging purposes.

Configuration

In order to successfully run jobs in an elastic computing environment, some configuration must be done up front. The required files are config.tcl, and a base-queue-specific driver script, such as vovnc.tcl for Accelerator. Both files reside in the SWD/vovwxd directory. To install these files with an initial configuration, use the vovwxconnect utility. The utility provides options that configure the base queue type and name. Any change to the config.tcl file or the driver script after product start is automatically picked up by vovwxd.

The WXLauncher tasker is automatically configured and started when connecting to Accelerator Plus queues via vovwxconnect -wx.

In Accelerator

From the Accelerator perspective, you can look at a set called "WXTaskers:wx" to see all requested jobs coming from vovwxd. Each set should have some number of valid jobs (taskers that have exited normally) and few running/orange jobs (the current work being done) and a small number of queued/ cyan jobs. Only a small number of queued jobs should ever be present (e.g. 10-50) even though the Accelerator Plus session may have several hundred thousand jobs ready to run. The vovwxd daemon will continue to replenish the pending number as they get executed, so that the Accelerator queue size remains small. This does not impact FairShare negatively as only a single waiting job causes FairShare to kick in; teams that use Accelerator Plus will not be at a disadvantage just because a handful of jobs are present in Accelerator at any one time compared with others that run their entire workload in Accelerator.

Migrating from 2016.09 Version

When elastic computing was introduced in version 2016.09, it was based on Tcl daemons called vovelasticd and vovlsfd. This has been updated to a new process based on the binary daemon vovwxd.

Upon starting, vovwxd checks the WXSWD/autostart directory for scripts to start legacy daemons, such as vovelasticd or vovlsfd. If such legacy autostart scripts exist, vovwxd prints a warning stating that the vovwxd autostart script will not be installed. The legacy daemon will need to be manually stopped, and its autostart script will need to be disabled (renamed or removed). The vovwxd autostart script will then need to be copied from $VOVDIR/etc/autostart into WXSWD/autostart. If a legacy daemon autostart script does not exist, the autostart script for the new vovwxd daemon will be installed into WXSWD/autostart and the daemon will be automatically started.

Three environment variables can be set:
VOV_WX_STUCK_VOVWXD_AGE
Age at which the vovwxd daemon should be considered as stuck and an alert should be generated.

Set in wx.swd/setup.tcl. Age can be in timespec or seconds format.

VOV_WX_STUCK_VOVWXD_RESTART
If set to 1, the vovwxd daemon will be restarted if the stuck age has been exceeded.
VOV_WX_STUCK_VOVWXD_ALERT_AGE
If set to 1, the vovwxd daemon will be straced if the stuck age has been exceeded.