Job Runtime - Monitor and Profile

When a job is running through a vovtasker, the tasker automatically monitors RAM and CPU utilization of the job, including all of its children.

Job statistics are sampled about once a minute. This data sampling rate does not capture jobs that complete in less time than the than the sampling period.

The MAXRAM is expressed in Megabytes (MB), where 1MB = 1<<20 bits (left-shift decimal "1" 20 times is the binary equivalent of 1 million. The CPU time is stored in ms (milliseconds), but is expressed in s (seconds).

CPU Progress and Run Status Indicators

Accelerator monitors CPU and RAM utilization for all the running jobs. The CPU utilization information is available in four fields:
CPUTIME
The total accumulated CPU time in milliseconds.
CPUPROGRESS
Percentage of CPU accumulated in the unit time. For example, if in 60 seconds a job uses 60 seconds of CPU time, then the CPUPROGRESS is going to be 100. This field can be 0 (zero) for jobs that are stuck: holding onto the CPU resource but not running, which makes the CPU unavailable for other jobs. This field can also be greater than 100 for multi-threaded jobs.
LASTCPUPROGRESS
A timestamp indicating the last time CPU usage has increased. This is used to identify stuck jobs.
RUNSTATUS
A descriptive text field that shows how well the job is doing. Some typical values are Good, Paging, NoCpu. The complete list of values is shown below.
Table 1. Values of the RUNSTATUS Field
n/a Insufficient information to determine CPU progress. Typical for jobs that have just started.
Good The progress is greater than 70%
Medium Progress is between 10% and 70%
Poor Less than 10% CPU utilization, but no swapping of pages.
Paging The progress is less than 10% and the job is swapping at a rate greater than 1000 pages per second.
NoCpu The job is not accumulating any CPU time.
Susp The job is suspended.

Job Profiling

When job profiling is activated, Accelerator tracks and plots performance statistics over the time the job is running.

The profiling plots show, in order, the following performance data over time:
  1. RAM usage
  2. VM size
  3. CPU utilization
  4. Cumulative Read I/O
  5. Cumulative Write I/O
  6. License checkouts (one plot per license)
The output of job profiling is a set of plots as shown below:


Figure 1.
To activate profiling on a single job, use the option -profile of nc run as shown below:
% nc run -profile myJob

To view a profile, use the browser interface and visit the specific page for the job.

To activate job profiling for a jobclass, set the following:
# In a job class definition
set VOV_JOB_DESC(profile) 1
To activate job profiling for all jobs, use the file $VOVDIR/local/vncrun.config.tcl and add a line like this:
# In the file $VOVDIR/local/vncrun.config.tcl
...
set VOV_JOB_DESC(profile) 1
...

Job I/O Profiling

The I/O job profiling feature enables Accelerator to track and plot performance statistics over the time the job is running. A summary of this information is displayed after the job completes by use of the nc info –ioprofile JOBID command.

This feature can only be activated with a Mistral license, which must be installed at $VOVDIR/local/mistral.dat. Check with your system administrator for accessing and installing the license. You can get more information about the Mistral license at Altair License Management.

The summary information shows I/O performance metrics broken down by filesystem. The statistics are aggregated from all processes in a job. The statistics reported include:
  • Data Xferred – The total number of data transferred in the specified I/O direction.
  • Throughput – A data rate calculated as a ratio of total data transferred divided by job run time.
  • Effective BW - A data rate calculated as a ratio of total data transferred divided by total latency.
  • I/O Ops – The number of I/O operations issued by the job for the specified I/O direction.
  • Total Latency – The elapsed time for all I/O system calls, summed.
  • Latency/Op - A ratio of total latency and number of I/O operations for the specified I/O direction
I/O Profiling Results - Read 

Filesystem      Data Xfered     Throughput      Effective BW    I/O Ops         Total Latency   Latency/Op 
/dev            1048576000B     349525333B/s    11351MB/s       1000            92372us         92us/op 
/users/kfeind   0B              0B/s            N/A             0               0us             N/A 

I/O Profiling Results - Write 

Filesystem      Data Xfered     Throughput      Effective BW    I/O Ops         Total Latency   Latency/Op 
/dev            0B              0B/s            N/A             0               0us             N/A 
/users/kfeind   1048576000B     349525333B/s    500MB/s         1000            2094136us       2094us/op 

Filesystems Accessed 

Filesystem      Type            Source 
/dev            devtmpfs        devtmpfs 
/users/kfeind   nfs             sdc-storenado1:/export/user_home/kfeind 
To activate profiling on a single job, use the option -profile of nc run as shown below:
% nc run –ioprofile ./myjob 

To view the summary profiling statistics, invoke the following CLI command after the job has completed.

% nc info –ioprofile  JOBID
To view a graphical display of I/O time series statistics while the job is running, invoke the nc gui client as follows.
Note: This is a preview feature.
% nc gui –ioprofile JOBID