Cold Upgrade

This method is usually implemented with a physical overhaul, a part of a major IT event. This kills all running jobs, which can be highly disruptive.

To reduce the impact, Accelerator can be instructed to stop accepting and dispatching new jobs for a period prior to the shutdown event, enabling some of the running jobs to complete. How many jobs will complete depends on the jobs' duration and the time allowed before shutdown.

  1. Download the Accelerator upgrade software.
  2. Install the new software.
  3. Using the new version, create a separate, temporary test queue (to validate the new version while production continues).
  4. Validate the installation using the test queue that you created.
  5. Schedule and announce the upgrade.
  6. If you have multiple Accelerator queues, it is recommended that you set thNC_QUEUE to the name of the queue that is undergoing maintenance. This helps prevents accidentally shutting down the wrong queue. Use the command:
    setenv NC_QUEUE vncNameOfQueue
  7. Optional: Suspend the vovtaskers with the command below. This command puts the vovtaskers in the SUSP state; running jobs will continue, but vovtasker will not accept new jobs. When vovtasker completes its current set of jobs, it will exit.
    nc cmd vovtaskermgr stop
  8. Optional: At the point of the scheduled downtime, document the IDs of running jobs as those jobs will be terminated forcefully. This list can be used to inform users that their jobs were terminated by the maintenance event.
    nc list -r -a -O @ID@ @USER@ @COMMAND@
  9. Optional: To automatically identify jobs when the queue is restarted, place the jobs in a special set.
    Example:
    nc cmd vovset create "ImpactedByQueueRestart" "isjob status==RETRACING"
  10. Optional: Terminate these running jobs with the -force option, which should terminate the remaining taskers within a few minutes.
    Example:
    vovtaskermgr stop -force -all
  11. Stop the vovserver of the queue with the command ncmgr stop
  12. Proceed with any necessary infrastructure maintenance.
  13. Restart the queue.
  14. Ensure that your shell is configured to support the correct number of file descriptors.
    Note: This value cannot be changed after starting the queue.
  15. Ensure that you are pointing to the appropriate version of Accelerator with the command which nc.
  16. Start the queue with the command ncmgr start.
    A confirmation dialog will open.
  17. Review the parameters carefully (especially number of file descriptors) before replying 'yes'. (Starting the queue will automatically start the taskers but this will take some time, be patient.)
  18. Validate that jobs are dispatching normally.
    Note: Restarting a large compute farm will take several minutes.
  19. Optional: Re-queue the jobs that were impacted by the shutdown. Use the following command:
    nc rerun -f -set ImpactedByQueueRestart