FairShare

FairShare allocates CPU cycles among groups and users according to policies defined by the administrators. FairShare is the dominant criteria in the scheduler, more important than job priorities. The fairness of CPU time allocation is computed using a multi-level FairShare tree. Each node in the tree is called a FairShare Group (abbreviated fsgroup) and is characterized by a name, a weight, and a time window.

Each job belongs to one and only one fsgroup. The contribution of each job to the FairShare mechanism is controlled by the parameter fstokens, which is 1 by default. If fstokens is 0, then the job will not contribute anything to the actual share (this is only rarely useful). If fstokens is 2, then the job contributes twice as much as a regular job with fstokens set to 1.

An fsgroup is "active" if some of its jobs are either running or queued, or, recursively, if any of its children is active.

The FairShare tree can be surprisingly large. Some organization have more than 16,000 nodes in the tree, on account of the large number of projects and users, although typically, at any one time, only less than 100 fsgroups are active.

The goal of the FairShare algorithm is to allocate resources to the active fsgroup so that the actual share of resources is as close as possible to the target share defined by the weights. To be clear, the fsgroups that are not active are not considered in the FairShare algorithm.
  • The target share is computed from the fsgroup weights and allocated to all active fsgroups. The inactive fsgroups get a target of 0%
  • The actual share is computed from the contribution of each job according to the overlap of the job execution and the time window multiplied by the fstokens parameter. The actual share consists of two components: the "running actual share" based on the number of jobs currently running, weighted by fstokens, and the "historical actual share" based on the overlap of the job execution time and the time window, also weighted by fstokens.

Each active fsgroup is assigned a 'rank' computed from the difference between its target share and actual share. That rank is then assigned implicitly to all jobs that belong to that fsgroup. The fsgroup that has the highest deficit will get rank of 0, while the fsgroup with the largest excess share will get a large rank (depending on current number of active fsgroups). The rank determines which jobs are preferred for dispatch, so that the scheduler first considers dispatching the jobs that have lower rank, i.e. jobs from fsgroups under their target share. If those jobs cannot be dispatched, because of other constraints such as RAM or limits, then the scheduler considers jobs with higher rank.

The default FairShare window is 2 hours meaning that we consider the time interval starting 2 hours before the present. Normally all nodes in the FairShare tree have the same time window, but that is not a requirement. In particular, it is possible to set the time window to zero in a node to disable FairShare for nodes under that node, by setting the rank of all children to the same value.

Selecting the appropriate window size is a balance between responsiveness and accuracy. A wide time window required more computation than a narrow window. The average job length and overall daily workload should be taken into account when selecting an appropriate window size.

As a rule of thumb, if your workload is small, i.e. under 100,000 jobs per day, do not worry about the FairShare window. If your workload exceeds 100,000 jobs per day, perhaps you want to use a shorter time window, such as 10 minutes. Workloads of millions of jobs per day can benefit with a time window of 2 minutes. Also relevant here is the frequency of update of the actual shares, which is controlled by the parameter fairshare.updatePeriod. The default value for this parameter is 0, meaning that the FairShare data is updated a frequently as needed, perhaps multiple times a second. For large workloads it may be a good idea to set that parameter to 3 or 5 seconds.

FairShare Rank

A fsgroups's rank ranges from zero upward. Jobs from fsgroups ranked closest to zero are preferred for dispatch. The rank is computed by ordering the fsgroups by a combined 'distance' between its target share and actual share. This distance has a 'running' component and a 'historical' component, based on jobs in the window. The vovserver configuration parameters fairshare.relative and fairshare.relative_alpha control the influence of the historical versus running distance on the actual rank. Refer to Server Configuration for details.

FairShare features:
  • Multiple multi-level FairShare trees are supported. The default number of levels is 2.
  • Each node in a FairShare tree has its own window size and weight.
  • Ability to disable FairShare for a sub-tree by setting the window size to zero.
  • Privileges are controlled with Access Control Lists (ACLs) for fine grained control.

FairShare Tree Naming Conventions

Each fsgroup has a hierarchical name where the components are separated by a "/", similar to a file name. The default fsgroup is /time/users. The name can take one of the following three forms:

Type Form Example
FS-Group HIERARCHICAL_GROUP_NAME /time/users
FS-User HIERARCHICAL_GROUP_NAME.USER_NAME /time/users.joe
FS-Subgroup HIERARCHICAL_GROUP_NAME.USER_NAME:SUBGROUP_NAME /time/users.joe:myregression1

Each component in the name has to be alpha-numeric, and can contain _. The . character is not allowed except in the FS-User component. The / and : are not allowed anywhere.

Type Example
FS-Group /proj/sanjose/library/qa
FS-User /proj/sanjose/library/qa.john
FS-Subgroup /proj/sanjose/library/qa.john:mytest1

Each node in the FairShare tree has an owner who has the authority to set the weights for all the subnodes in the tree. For example, the owner of group /time/med can set the weights for /time/med/sanjose and any other nodes of the form /time/med/*.

Define FairShare Groups

The FairShare tree is dynamic and can be changed at any time. If you like a configuration, you can save it into a file and then you can reload it at a later time. The main tool to perform these actions is vovfsgroup

FairShare Command Line Utilities

To view all FairShare groups, use vovfsgroup show:
ID        GROUP                                           OWNER WEIGHT   WINDOW RUNNING   QUEUED
000000016 /                                            (server)      0    1h00m       0        0
000001012 /system                                        cadmgr    100       0s       0        0
000001050 /system/taskers                                 cadmgr    100       0s       0        0
000001053 /system/taskers/messages                        cadmgr    100       0s       0        0
000001056 /system/taskers/reservations                    cadmgr    100       0s       0        0
000001006 /time                                          cadmgr    100    1h00m       0        0
000001009 /time/users                                    cadmgr    100    1h00m       0        0
000001081 /time/users.cadmgr                             cadmgr    100    1h00m       0        0
An older method still available is to use vovshow -groups:
% vovshow -groups
  ID       GROUP                          WEIGHT   WINDOW
  02223424 /system                           100    1m00s
  02223422 /time                             100    1h00m
  02223423 /time/users                         1    2h00m
  02223435 /time/users.cadmgr                100    1h00m
For a specific fsgroup, you can use an additional argument to vovfsgroup show FSGROUPNAME:
% vovfsgroup show /time/users
Id:       000001009
FullName: /time/users
Owner:    cadmgr
Weight:   100
Window:   1h00m
Rank:     -1
        ACL  1: OWNER      ""   ATTACH DETACH EDIT VIEW STOP FORGET DELEGATE EXISTS
        ACL  2: EVERYBODY  ""   ATTACH VIEW
 000001081 /time/users.cadmgr                          100          cadmgr
Permission is required to create fsgroups and to change their weight. You can try the following commands as the ADMIN user for your Accelerator instance:
% vovproject enable vnc
% vovfsgroup create /app/primetime 
% vovfsgroup create /app/spice
% vovfsgroup create /app/other 
% vovfsgroup modify /app/primetime weight 300
% vovfsgroup modify /app/spice     weight 100
% vovfsgroup modify /app/other     weight  20
% vovfsgroup modrec /app           window 1h
% vovfsgroup exists /app/other
% vovfsgroup exists /app/not_there
% vovfsgroup genconfig  saved_my_cool_config.tcl    ### Important to use the .tcl extension
% vovfsgroup delete /app
% vovfsgroup loadconfig saved_my_cool_config.tcl    

For more information, refer to Configure FairShare via the vovfsgroup Utility.

Monitor FairShare

From the command line, start the monitors with
% nc monitor

From the browser interface, visit the Project Home page and then select the FairShare link.

Target Share Example

In the following example, it is assumed two groups are defined, the default group /time/users and another group named /time/regr. The users maureen and murali are members of the /time/users group. User john is a member of the /time/regr group. It is also assumed that all users have jobs queued. Following is how the target shares would be determined using the two-tier method.

There are two groups in the queue, each group with a weight of 100. Therefore, at the start, the FairShare percentage of each can be calculated as such:
/time/regr             share = 100/(100+100)    50%
/time/users            share = 100/(100+100)    50%
Now assume that within the /time/regr group, only john has jobs. That user gets 100% of the group's share or 50% of the overall cycles.
/time/regr.john        share = 100/(100+100) 	50% * 100% = 50%
/time/users            share = 100/(100+100)    50%
Within the /time/users group, two users have jobs as shown below:
 /time/users.maureen    share = 10/(10+10)       50% * 50% grp = 25%
 /time/users.murali     share = 10/(10+10)       50% * 50% grp = 25%
If another user suresh, who is a member of the users group submits jobs that are queued, the target shares would change as follows:
/time/users.maureen    share = 10/(10+10+10)    33% * 50% grp = 16.7%
/time/users.murali     share = 10/(10+10+10)    33% * 50% grp = 16.7%
/time/users.suresh     share = 10/(10+10+10)    33% * 50% grp = 16.7%
Because the user suresh just entered the queue, his actual share will probably be much less than the target share. Therefore, his jobs will be launched ahead of the other users as the system tries to bring his actual share up to his target share. An example of the overall FairShare picture is shown below (target shares shown):
/time/regr.john        share = 100/(100+100) 	 100% * 50% grp = 50.0%
/time/users.maureen    share = 10/(10+10+10)    33% * 50% grp = 16.7%
/time/users.murali     share = 10/(10+10+10)    33% * 50% grp = 16.7%
/time/users.suresh     share = 10/(10+10+10)    33% * 50% grp = 16.7%

Also in This Section