Crash Recovery Mode

Crash Recovery Mode is activated the next time the server is restarted, if the server was not shut down cleanly. Crash Recovery Mode is part of the Failover Server capability of VOV, which is mainly used in Accelerator. This capability allows VOV to start a new server to manage the queue of a server which has crashed unexpectedly.

When you shut down an VOV Subsystem instance cleanly using the ncmgr stop command, the server will save its database to disk just before exiting. When the server is restarted, it will read the state of the trace from disk, and immediately be ready for new work.

Sometimes the vovserver will be stopped unexpectedly, such as due to a hardware problem like a machine crash or memory exhaustion. In such cases, the server will not have a chance to save the project database before terminating.

  • In VOV, the main concern is usually the state of the trace, which stores the status all the jobs in your project.
  • In Accelerator, there is no trace, and the important thing to preserve is the state of the queue, so jobs do not lose their position and need to be re-queued in the case of a server crash.

Journal Files

The vovserver keeps crash recovery journal files of the events that affect the state of the server. These 'CR' files are flushed whenever the trace data are saved to disk. During crash recovery, the vovserver first reads the last saved state of the trace from the disk data, then applies the events from the CR files.

Crash Recovery Restart

When the server is next restarted after such a crash, the server enters what is called Crash Recovery Mode, which usually lasts about two minutes, but may take longer if the CR files are very large. During this period:
  • The server waits for vovtaskers with running jobs to reconnect.
  • No jobs are dispatched.
  • The server does not accept VOV or HTML TCP connections from vovsh or browser clients.
  • At the end, the server performs a global sanity check.
  • A crash_recovery_report <timestamp> logfile is written. It logs any jobs lost by the crash recovery sequence.
If the server is properly shut down, the next time it restarts, it will not enter Crash Recovery Mode, and will be immediately functional.
Note: If the sanity check command cannot connect, the server is still recovering its state from the DB and CR files. Check the size of the vnc.swd/trace.db/CR* files.
Example commands:
% vovproject enable <project-name>
% vovproject sanity