Rerun Jobs

The nc rerun command initiates scheduling and executing jobs that are already in the server database.

By default, only the jobs that are idle or queued are affected by this command. To force rerunning jobs that are either complete or failed, use the option -F.

Examples of rerunning jobs:
% nc rerun 2345 2355
nc: message: Scheduled jobs: 1 Total estimated time: 1m13s
nc: message: Scheduled jobs: 1 Total estimated time: 58s
% nc rerun -p 8 2345
nc: message: Scheduled jobs: 1 Total estimated time: 1m13s
% nc rerun -d . 
nc: message: Scheduled jobs: 44       Total estimated time: 20m23s

nc rerun

Rerun jobs already in the system.


nc: Usage Message

 NC RERUN:
        Rerun jobs already in the system
 USAGE:
        % nc rerun [OPTIONS] jobId ...

 OPTIONS:
        -dir <directory>   -- Rerun all jobs in the given directory.
        -f                 -- Same as -F (backwards compatibility).
        -F                 -- Force to rerun all specified jobs.
        -first             -- Put job at top of its bucket.
        -force             -- Same as -F.
        -h                 -- Help usage message.
        -J   <jobname>     -- Rerun all jobs with the given jobname.
        -mine              -- Rerun all my jobs.
        -p   <priority>    -- Set scheduling priority.
        -after <time>      -- Rerun job after specified time.
        -r <aux_resources> -- Additional resources to be used.
        -set <setname>     -- Rerun all jobs in the given set.
        -v                 -- Increase verbosity level.

 OBSOLETE OPTIONS:
        -all               -- Rerun all jobs.
    

Automatic Rerunning of Failed Jobs

If a job fails quickly, it is possible that a faulty tasker machine may be to blame. Under a faulty machine scenario, many jobs can quickly run and fail, potentially emptying the job queue, requiring those jobs to be manually rescheduled.

The main method of protecting against this scenario is Black Hole Detection. If for any reason this method of protection does not provide adequate coverage, a secondary method of protection is provided: auto-rescheduling.

This feature instructs the scheduler to rerun a job that fails within a configurable time threshold on a different host or tasker (depending on parameter autoRescheduleOnNewHost) than the one on which it previously failed.
Note: This feature may cause unwanted results in cases where job durations can be short and under the configured threshold. Because of this, the feature is disabled by default.

To enable auto-rescheduling, set the threshold to a non-zero value. The threshold is controlled by setting config(autoRescheduleThreshold) in the policy.tcl file. The value is a timespec: 3s is 3 seconds. The typical value for this parameter is 2 to 4 seconds.

The default behavior is to rerun on a different host, but the scheduler can be configured to target a different tasker instead, via the config(autoRescheduleOnNewHost) setting. The default value is 1, which avoids the entire host. Set the value to 0 to avoid the tasker. Normally, only one tasker runs on a given host, but in some scenarios, multiple taskers may be present on the same host. In this case, the job may be routed back to the same physical machine as before. If the problem that caused the failure was temporary, the job may succeed. If the problem still exists, the job will likely fail again and be auto-rescheduled to another tasker in the pool.

Before a job is automatically resubmitted, the negated name of the host or tasker where the job previously failed is appended to the job's requested resources. The job will not run again on a host or tasker where it has failed.

For example, if the job requests the resources linux hsim and fails on a host named broken-host, it will be resubmitted with the resources
linux hsim !broken-host
If the job fails again quickly on a host named jupiter, the resources will become
linux hsim !broken-host !jupiter
Automatic resubmission is allowed a limited number of times between 0 and 10, which is controlled by the following parameters in the order shown:
  1. The property MAX_RESCHEDULE attached to the job (if any);
  2. The property MAX_RESCHEDULE attached to the jobclass set. This property can be set with vtk_jobclass_set_max_reschedule;
  3. The parameter autoRescheduleCount (default value 4), can be set in policy.tcl. An example follows:
    # Fragment of policy.tcl
    set config(autoRescheduleCount)      4
    set config(autoRescheduleThreshold)  2s
    set config(autoRescheduleOnNewHost)  1