Preemption Rules to Speed Up FairShare
Every node in the FairShare tree represents a FairShare group. Each job belongs to one and only one FairShare group. Every node in the FairShare tree is assigned a target share, which depends on both the weights assigned to the nodes in the tree and on the activity of the nodes. A FairShare node is considered active if it has at least one job that is queued, running or suspended. All nodes that are not active are assigned a FairShare target of zero.
The target share of a FairShare node is accessible by the
field FS_TARGET
for any job that belongs to that FairShare node. The FairShare target is
a fractional number less than 1.0, but the FS_TARGET
field is an
integer in the range from 0 to 10,000 obtained by scaling up the FairShare target by 10,000. For example, a
FS_TARGET
of 8000 indicates that the FairShare node has a target share of 80%(=0.8).
FS_RUNNING
represents the fraction of running jobs in a
FairShare group, relative to all running jobs in the
system. This field is also scaled up by a factor of 10,000. The difference between
FS_RUNNING
and FS_TARGET
is
FS_EXCESS_RUNNING
.
FS_EXCESS_RUNNING := FS_RUNNING - FS_TARGET
This measures how much a group is above or below its target. A positive number of
FS_EXCESS_RUNNING
means that the FairShare group is running more jobs than it should.
The field FS_RUNNING_COUNT
is the number of jobs a FairShare group is running.
FS_HISTORY
represents the fraction of jobs that have been
run by a FairShare group in the FairShare window (typically 2 hours) relative to all other jobs
that have been run in the system. The difference between FS_HISTORY
and FS_TARGET
is FS_EXCESS_HISTORY
, and is similar
to FS_EXCESS_RUNNING
explained
above.FS_EXCESS_HISTORY := FS_HISTORY - FS_TARGET
The field FS_RANK
is computed by the scheduler and assigned to each
FairShare group that has jobs in the queue. The jobs
are dispatched to taskers in ascending order of rank, starting
from the group of rank zero (0). Groups that have no jobs in the queue are assigned
the conventional rank -1. For FairShare and preemption to
work harmoniously, it is important that the rank of the preempted job is greater
than the rank of the preempting job, which is why the preemption rules should
contain a term of the form FSRANK>@FSRANK@
in the
-preemptable option. Since you also want to allow the
preemption of jobs that are running but have no queued jobs in the same group, use
the field "FS_RANK9", which is the same as FS_RANK, except that the value of
FS_RANK9 for groups that have no queued jobs is 9,999,999 instead of -1, which makes
for an easier comparison the preemptable rule FSRANK9>@FSRANK9@
.
The Special Field FS_EXCESS_RUNNING_LOCAL
The picture below illustrates the difference between the
FS_EXCESS_RUNNING
field and
FS_EXCESS_RUNNING_LOCAL
. While the first considers the total
number of running jobs in the system, the second field only considers the balance of
running jobs at each local level. In the pictures, the nodes of interest are
/class/hsim
and /class/vcs
.
/class/vcs
has a total of 4 running jobs and 2 children,
with user u1 running 3 jobs and user u5 running 1. Assuming that all weights are the
same in all branches, the target share for /class/vcs.u1
and
/class/vcs.u5
is exactly the same. Looking at the
FS_EXCESS_RUNNING
, it is negative for both nodes because the
node /class/hsim
has a large proportion of the running jobs. In
this scenario, a preemption rule based on FS_EXCESS_RUNNING
as
shown below will not fire:
VovPreemptRule -rulename RuleThatDoesNotFire \
-preempting "JOBCLASS==vcs FS_EXCESS_RUNNING<0" \
-preemptable "JOBCLASS==vcs FS_EXCESS_RUNNING>0 FSRANK9>@FSRANK9@" \
-pool fastfairshare -ruletype FAST_FAIRSHARE

/class/vcs
, it is apparent
that the distribution of jobs is not balanced. To use preemption to speedup the
achievement of balance, the FS_EXCESS_RUNNING_LOCAL
field can be
used as follows:
VovPreemptRule -rulename RuleThatFires \
-preempting "JOBCLASS==vcs FS_EXCESS_RUNNING_LOCAL<0" \
-preemptable "JOBCLASS==vcs FS_EXCESS_RUNNING_LOCAL>0 FSRANK9>@FSRANK9@" \
-pool fastfairshare -ruletype FAST_FAIRSHARE
Practical FairShare Driven Preemption
- FS_EXCESS_RUNNING_LOCAL
- FS_RANK and FS_RANK9
- FS_RUNNING_COUNT
Other fields are described in Node Fields; those fields also begin with FS_
.
Preemption Based on FairShare
Preemption can be used as a method to accelerate the FairShare mechanism, so that instead of waiting for a job to finish and a slot to open up, the preemption daemon can detect imbalances in the FairShare and preempt a job of a group that has excess share in favor a another group that has a deficit in the share.
# Copyright (c) 1995-2020, Altair Engineering
# All Rights Reserved.
# $Id: $
# Use of preemption to speed-up fairshare.
#
# We assume a workload organized in jobclasses where the each jobclass
# has its own fairshare node called /class/$JOBCLASS.
#
# The preempting is triggered if there is a job which has a locally a
# deficit in the number of running jobs (FSEXCESSRUNNINGLOCAL<0) and has been waiting for at least 10 seconds.
# Also, if a fairshare group already has at least 4 jobs running, do not preempt.
#
# We do preemption within the same jobclass (JOBCLASS==@JOBCLASS@)
# and we target the groups that have excessive share of running jobs (FSEXCESSRUNNINGLOCAL>0) and
# also a higher rank (FSRANK9>@FSRANK9@). We use FSRANK9 instead of FSRANK to
# simplify the comparison of the ranks to include groups that have no rank.
# We do not want to preempt if the group has only one running job (FS_RUNNING_COUNT>1).
# We also consider priority (PRIORITY<=@PRIORITY@) to avoid preempting a job of higher priority.
#
VovPreemptRule -rulename FastFairshare \
-preempting "FSEXCESS<0 GROUP~/class FS_RUNNING_COUNT<=3" \
-bucketage 10 \
-preemptable "FSEXCESS>0 JOBCLASS==@JOBCLASS@ FS_RUNNING_COUNT>1 FSRANK9>@FSRANK9@ PRIORITY<=@PRIORITY@" \
-killage 2m \
-pool FastFairshare \
-ruletype FAST_FAIRSHARE