Troubleshooting

When a problem occurs, first run vovcheck and correct the reported errors:
% vovcheck
vovcheck: message: Creating report /usr/tmp/vovcheck.report.15765
Test: BasicVariables                                              [OK]
Test: EnvBase                                                     [OK]
Test: FlowTracerLicense                                           [OK]
Test: FlowTracerPermissions                                       [OK]
Test: GuiCustomization                                            [OK]
Test: HostPortConflicts                                           [ERROR]
Test: Installation                                                [OK]
Test: OldMakeDefault                                              [OK]
Test: Rhosts                                                      [OK]
Test: Rsh                                                         [OK]
Test: SecurityPermissions                                         [WARN]
Test: TaskerRoot                                                   [OK]
Test: WritableLocal                                               [WARN]
Test: WritableRegistry                                            [OK]
Test: vovrc                                                       [OK]
vovcheck: message: Detailed report available in /usr/tmp/vovcheck.report.15765

The Server Does Not Start

The recommended checkpoints:
  • Make sure you have a valid RLM license. Type:
    % rlmstat -a
  • Check if the server for your project is already running on the same machine. Do not start a Accelerator Plus project server more than once. For example, you can try:
    % vovproject enable project
    % vsi
  • Check if the server is trying to use a port number that is already used by another vovserver or by another application. VOV computes the port number in the range [6200, 6455] by hashing the project name. If necessary, select another project name, or change host, or use thVOV_PORT_NUMBER to specify a known unused port number. The best place to set this variable is in the setup.tcl file for the project.
  • Check if the server is trying to use an inactive port number that cannot be bound. This can happen when an application, perhaps the server itself, terminates without closing all its sockets.
    The server will exit with a message similar to the following:
    ...more output from vovserver...
    vs52 Nov 02 17:34:55          0                3      /home/john/vov
    vs52 Nov 02 17:34:55 Adding licadm@venus to notification manager
    vs52 Nov 02 17:34:55 Socket address 6437 (net=6437)
    vs52 ERROR Nov 02 17:34:55 Binding TCP socket: retrying 3
    vs52 Nov 02 17:34:55 Forcing reuse...
    vs52 ERROR Nov 02 17:34:58 Binding TCP socket: retrying 2
    vs52 Nov 02 17:34:58 Forcing reuse...
    vs52 ERROR Nov 02 17:35:01 Binding TCP socket: retrying 1
    vs52 Nov 02 17:35:01 Forcing reuse...
    vs52 ERROR Nov 02 17:35:04 Binding TCP socket: retrying 0
    vs52 Nov 02 17:35:04 Forcing reuse...
    vs52 ERROR Nov 02 17:35:04 
    PROBLEM:  The TCP/IP port with address   6437   is already being used.
    
    POSSIBLE EXPLANATION:
            - A VOV server is already running   (please check)
            - The old server is dead but some
              of its old clients are still alive  (common)
            - Another application is using the
              address  (unlikely)
    
    ACTION: Do you want to force the reuse of the address?

    In this case:

    1. List all VOV processes that may be running on the server host and that may still be using the port. For example, you can use:
      % /usr/ucb/ps auxww | grep vov
      john   3732  0.2  1.5 2340 1876 pts/13   S 17:36:18  0:00 vovproxy -p acprose -f - -b
      john   3727  0.1  2.2 4816 2752 pts/13   S 17:36:16  0:01 vovsh -t /opt/rtda/latest/linux64/tcl/vtcl/vovresourced.tcl -p acprose
      ...
    2. You can wait for the process to die on its own, or you can kill it, for example with vovkill
      % vovkill pid
    3. Restart the server.
  • You run the server as the Accelerator Plus administrator user. Please check the ownership of the file security.tcl in the server configuration directory vwx.swd.

The UNIX Taskers Do Not Start

Accelerator Plus normally relies on remote shell execution to start the taskers using either rsh or ssh.
  • If using rsh try the following:
    % rsh host vovarch
    where host is the name of a machine on which there are problems starting a tasker.

    This command should return a platform dependent string (such as "linux") and nothing else. Otherwise, there are problems with either the remote execution permission or the shell start-up script.

  • If the error message is similar to "Permission denied", check the file .rhosts in your home directory. The file should contain a list of host names from which remote execution is allowed. You may have to work with your system administrators to find out if your network configuration allows remote execution.
  • If using ssh, perform the test above but use ssh instead of rsh. For more details about ssh check SSH Setup.
  • If you get extraneous output from the above command, the problem is probably in your shell start-up script. If you are a C-shell user, check your ~/.cshrc file. Following are guidelines for a remote-execution-friendly .cshrc file:
    • Echo messages only if the calling shell is interactive. You can test if a shell is interactive by checking the existence of the variable prompt, which is defined for interactive shells. Example:
      # Fragment of .cshrc file.
      if ( $?prompt ) then
      echo "I am interactive"
      endif
    • Many .cshrc scripts exit early if they detect a non-interactive shell. It is possible that the scripts exit before sourcing ~/.vovrc, which causes Accelerator Plus to not be available in non-interactive shells. Compare the following fragments of .cshrc files and make sure the code in your file works properly:
      The following example will not work properly for non-interactive shells:
      if ( $?prompt ) exit
      source ~/.vovrc
      This example is correct; source .vovrc and then check the prompt variable:
      source ~/.vovrc
      if ( $?prompt ) exit
      This example is also correct:
      if ( $?prompt ) then
      # Define shell aliases
      ...
      endif
      source ~/.vovrc
    • Do not apply exec to a sub-shell. This will cause the rsh command to hang.
      # Do not do this in a .cshrc file
      exec tcsh

License Violation

Accelerator Plus is licensed by restricting the number of tasker slots. This is the sum of the tasker slots from both elastic and statically defined taskers (if any).

You can find out the capacity of your license (wx_slots) with the following command:
% rlmstat -avail 

The file $VOVDIR/../../vnc/vwx.swd/taskers.tcl defines the list of static taskers that are managed by the server. Make sure the number of tasker hosts is within the license capability.

Crash Recovery

In the event of a crash or failover, you can find a checklist of what to do at http://wx-host:wx-port/cgi/sysrecovery.cgi.

This address can found using the command:
wx cmd vovbrowser -url /cgi/sysrecovery.cgi

When Things Go Wrong

The following may occur after a major infrastructure event, a problem with submission scripts or a change to the workload or a change to things like Limits.

The most significant symptom is "stuck jobs": most likely elastic daemon has stopped. Check the elastic daemon log.
Note: Both cases are common Accelerator debug scenarios.
If the timestamp is not fresh, you will need to restart:
  • Save off the existing log file (and send that file to Altair for diagnosis).
  • Restart: nc -f $WxQueueName cmd vovautostart
If elastic daemon is running:
  • Check that tasker jobs are being submitted and that they are being executed by the base scheduler. To do this, connect to the correct Accelerator cluster through the web browser, locate the job set called vovelasticd. This set contains other sets, one set for each Accelerator Plus session.
  • Locate the appropriate set for your Accelerator Plus session. Look at the name of the set.
  • If you see only cyan (scheduled) jobs, the problem is that the base scheduler cannot schedule these tasker jobs. You need to debug why these jobs are not being run by the underlying scheduler.
  • If the jobs are getting run but keep failing (turning red) then debug and determine the reason for those failures.

When Jobs Are Not Running

You may discover that a tasker job is asking for an impossible resource. This is often due to an error in the resource requirements, or an error in the configuration. You fix the problem but Accelerator Plus continues to not run jobs. In this case, those "impossible" jobs are still seen by the base scheduler (and are therefore not runnable) but also they are seen by the elastic daemon, which assumes that these jobs are runnable and that no new jobs should be submitted: a live-lock scenario.

The recommendation here is to dequeue any queued jobs in the base scheduler after changing the problematic resource request. This lets the elastic daemon launch replacement jobs.

A different scenario is tasker jobs being scheduled, dispatched and running in the base but no jobs are getting executed inside the Accelerator Plus session. The symptom is a growing list of green nodes (valid) with a small number of orange (running) and cyan (scheduled) jobs. In this case, the base is dispatching the tasker jobs without a problem, but Accelerator Plus is not making use of them. The results: they execute, waiting for the end user job that never comes until they hit their maximum idle limit.

For this case, we recommend checking the Accelerator Plus session using either a vovconsole (wx -q $WxQueueName cmd vovconsole) or an Accelerator Plus monitor. If you see no activity in the LED bar (taskers connecting, waiting and then terminating - often yellow-to-green-to-black), most likely there is a problem with the configuration. The taskers that the base scheduler is executing are not connecting back to the desired Accelerator Plus session.

For the configuration, verify that the taskers are being launched with the right parameters and check that the config files are correct:
  • If the LED monitor is active, then taskers are connecting and the failure to launch is mostly likely in Accelerator Plus. A common problem is that the job requests a limit or a special resource: the limit must be satisfied in Accelerator Plus and the base scheduler. In this case, elastic daemon tries to detect expandable limits sucy as Limit:foo_@USER@_N and will set the limit to be unlimited in Accelerator Plus, and then pass the limit request on to the base scheduler where it is honored. Occasionally, this process goes wrong.
  • If the limit does not exist or is set to 0 in Accelerator Plus, then the job will not launch. In this state, the job will appear to be queued (cyan color), but you will not see a bucket created for it (wx mon can be used to help diagnose that case). To add the missing resources edit resources.tcl. In general, set the resource value to "unlimited" in Accelerator Plus; the restrictive value in the base scheduler will still be honored.

What to Check When Jobs are Not Being Submitted

Accelerator Plus reports the tasker as SICK and it has a missing heartbeat. Check in the base scheduler and see that the tasker has been suspended. In this case someone or something has suspended the tasker - both it and it's underlying job are suspended (T-state/CTRL-Z). Find out why - it could be a user or RAM sentry - the tasker job may have some property information telling you what. When resumed the tasker will become healthy (i.e. non SICK) and continue normally within the base scheduler and Accelerator Plus.

Everything seems fine but jobs are not being submitted. Check the base scheduler and if you find that the tasker jobs aren't running due to FairShare reasons then it may be possible there's a regression that is using the same FairShare node and subgroup (either from Accelerator Plus or native in the base scheduler). In this case the jobs will be treated in a first-come-first-served basis.

Avoid Suspending Accelerator Plus Taskers in the Base Scheduler

Elastic tasker jobs in the base scheduler should not be subjected to suspension events from users or the preemption system. Accelerator Plus supports a form of preemption called modulation. In this case, the base scheduler queue requests the Accelerator Plus tasker to terminate on the next Accelerator Plus job boundary.