vovwxd
vovwxd is a daemon used for dynamically requesting taskers from another job scheduler, that is, a base queue. This process is referred to as elastic computing. This daemon is the key component of the Accelerator Plus product and is also used with FlowTracer when interfacing with an external job scheduler.
In general, vovwxd connects to a base queue and watches the local
Accelerator Plus or FlowTracer workload; if a
job bucket is waiting for hardware or software resources, vovwxd
issues a request to the base queue for a tasker that will provide
more resources to the local scheduler. This request is referred to as a
launcher job. The launcher job runs on a local system tasker running under the name WXLauncher
and is
simply a job submission command to the base scheduler for the
wxagent utility. As long as the base queue provides the
resources the job is waiting for, the wxagent utility will run in
the base queue and will spawn a vovtasker that connects back to
the local vovserver. The workload is executed as the base
scheduler satisfies the requests for more taskers from
vovwxd. Typically, the taskers requested
by vovwxd offer only one slot, so that only one job at a time can
be run on one of these taskers.
The daemon is normally started automatically via the
SWD/autostart/start_vovwxd.tcl script and is managed
on-demand with the vovdaemonmgr
utility. The daemon can be executed
as a foreground process for debugging purposes.
Configuration
In order to successfully run jobs in an elastic computing environment, some configuration must be done up front. The required files are config.tcl, and a base-queue-specific driver script, such as vovnc.tcl for Accelerator. Both files reside in the SWD/vovwxd directory. To install these files with an initial configuration, use the vovwxconnect utility. The utility provides options that configure the base queue type and name. Any change to the config.tcl file or the driver script after product start is automatically picked up by vovwxd.
The WXLauncher
tasker is automatically configured and started when
connecting to Accelerator Plus queues via vovwxconnect -wx
.
In Accelerator
From the Accelerator perspective, you can look at a set called "WXTaskers:wx" to see all requested jobs coming from vovwxd. Each set should have some number of valid jobs (taskers that have exited normally) and few running/orange jobs (the current work being done) and a small number of queued/ cyan jobs. Only a small number of queued jobs should ever be present (e.g. 10-50) even though the Accelerator Plus session may have several hundred thousand jobs ready to run. The vovwxd daemon will continue to replenish the pending number as they get executed, so that the Accelerator queue size remains small. This does not impact FairShare negatively as only a single waiting job causes FairShare to kick in; teams that use Accelerator Plus will not be at a disadvantage just because a handful of jobs are present in Accelerator at any one time compared with others that run their entire workload in Accelerator.
Migrating from 2016.09 Version
When elastic computing was introduced in version 2016.09, it was based on Tcl daemons called vovelasticd and vovlsfd. This has been updated to a new process based on the binary daemon vovwxd.
Upon starting, vovwxd checks the
WXSWD/autostart directory for scripts to start legacy
daemons, such as vovelasticd
or vovlsfd
. If such
legacy autostart scripts exist, vovwxd prints a warning stating
that the vovwxd autostart script will not be installed. The
legacy daemon will need to be manually stopped, and its autostart script will need
to be disabled (renamed or removed). The vovwxd autostart script
will then need to be copied from $VOVDIR/etc/autostart into
WXSWD/autostart. If a legacy daemon autostart script does
not exist, the autostart script for the new vovwxd daemon will be
installed into WXSWD/autostart and the daemon will be
automatically started.
- VOV_WX_STUCK_VOVWXD_AGE
- Age at which the vovwxd daemon should be considered
as stuck and an alert should be generated.
Set in wx.swd/setup.tcl. Age can be in timespec or seconds format.
- VOV_WX_STUCK_VOVWXD_RESTART
- If set to 1, the vovwxd daemon will be restarted if the stuck age has been exceeded.
- VOV_WX_STUCK_VOVWXD_ALERT_AGE
- If set to 1, the vovwxd daemon will be straced if the stuck age has been exceeded.