Black Hole Detection

A Black Hole is a tasker that appears healthy but is unable to execute jobs. All jobs sent to that tasker quickly fail, and the tasker appears ready to execute the next job in queue, although all jobs submitted to that tasker fail.

On a given tasker, if a number (blackholeFailedJobs) of consecutive jobs fail within a relatively short time (blackholeDiscardTime), the tasker is potentially a black hole. The tasker is certainly a black hole only when we know that a large fraction (blackholeFailRate) of those jobs succeed on other taskers. When a tasker is a potential back hole, it is suspended for a short amount of time (blackholeMaybeTime), typically around 10 seconds. When a tasker is a black hole, it is suspended for a longer period (blackholeSuspendTime) typically around 10 minutes.

To activate the functionality, use:
% nc cmd vovsh -x 'vtk_server_config blackholedetection 1'
To disable the functionality, use:
% nc cmd vovsh -x 'vtk_server_config blackholedetection 0'
To check whether black hole detection is active, use:
% nc cmd vovsh -x 'vtk_generic_get policy a; parray a' | grep blackhole