Failover Server Candidates
If a server crashes suddenly, VOV has the capability to start a replacement server on a pre-selected host. This capability requires that the pre-selected host is configured as a failover server.
-
Edit or create the file servercandidates.tcl in the server
configuration directory. Use the vovserverdir command with
the -p option to find the pathname to this file.
% vovserverdir -p servercandidates.tcl /home/john/vov/myProject.swd/servercandidates.tcl
The servercandidates.tcl file should set the Tcl variable ServerCandidates to a list of possible failover hosts. This list may include the original host on which the server was started.set ServerCandidates { host1 host2 host3 }
-
Install the autostart/failover.sh script as follows:
% cd `vovserverdir -p .` % mkdir autostart % cp $VOVDIR/etc/autostart/failover.sh autostart/failover.sh % chmod a+x autostart/failover.sh
-
Activate the failover facility by running
vovautostart.
% vovautostart
For example:% vovtaskermgr show -taskergroups ID taskername hostname taskergroup 000404374 localhost-2 titanus g1 000404375 localhost-1 titanus g1 000404376 localhost-5 titanus g1 000404377 localhost-3 titanus g1 000404378 localhost-4 titanus g1 000404391 failover titanus failover
Note: Each machine listed as a server candidate must be a vovtasker machine; the vovtasker running on that machine acts as its agent in selecting a new server host. Taskers can be configured as dedicated failover candidates that are not allowed to run jobs by using the -failover option in the taskers definition.Preventing jobs from running on the candidate machine eliminates the risks of machine stability being affected by demanding jobs. The -failover option also enables some failover configuration validation checks. Finally, failover taskers are started before the regular queue taskers, which helps ensure a failover tasker is available as soon as possible for future failover events.
Refer to the tasker definition documentation for details on the -failover option.
How vovserver Failover Works
If the vovserver crashes, after a period of time, the vovtasker process on each machine notices that it has had no contact from the server, and it initiates a server machine election.
In this election, each vovtasker votes for itself (precisely, the host that this particular tasker runs on) as a server candidate. The election is conducted by running the script vovservsel.tcl.
After the time interval during which the vovtaskers vote expires, (default 60 seconds) the host that appears earliest on the list will be selected to start a new vovserver.
set ServerCandidates {
host1
host2
host3
}
ServerCandidates
list. Usually, these vovtaskers have been defined with the -failover
option so they can not accept any jobs, and are members of the
failover
taskergroup. The failover vovserver will read the most-recently-saved PR file from the .swd/trace.db directory, and then read in the transactions from the 'cr*' (crash recovery) files to recover as much of the pre-crash state as possible.
The vovserver writes a new serverinfo.tcl file in the .swd that vovtaskers read to determine the port and host. When it starts, the failover vovserver appends the new host and port information to the $NC_CONFIG_DIR/<queue-name.tcl> as well as to the setup.tcl in the server configuration directory. The vovserver then runs the scripts in the Autostart Directory. This should include the failover.sh script, which resets the failover directory so that failover can repeat. This script removes the registry entry, and removes the server_election directory and creates a new empty one. At the end, it calls vovproject reread to force the failover vovserver to create an updated registry entry.
- For Accelerator, Accelerator Plus, Monitor and Allocator, vovtaskers wait up to 4 days for a new server to start.
- For FlowTracer, vovtaskers wait up to 3 minutes for a new server to start.
After reconnecting to vovserver, vovtaskers automatically exit after all of their running jobs are completed. After the vovserver transitions from crash recovery mode to normal mode, it will try to restart any configured vovtaskers that are not yet running.
- The filesystem holding the .swd directory is unavailable.
- The file servercandidates.tcl does not exist.
- The
ServerCandidates
list is empty. - There is no vovtasker running on any host in the
ServerCandidates
list when the server crashes. - The autostart/failover.sh script file is not in place.
In this case, the failover server will not be automatically started; the server will have to be manually started.
Tips for Configuring Failover
- Make the first failover host the regular one. This way, if the vovserver dumps core or is killed by mistake, it will restart on the regular host.
- Configure special vovtaskers only for failover by passing the -failover option to vtk_tasker_define.
- Test that failover works before depending on it.
Migrating vovserver to a New Host
ncmgr rehost -host NEWHOST
The
specified NEWHOST must be one of the hosts eligible for failover of vovserver.