Cluster Dashboard

Additionally, the dashboard includes insights into node performance, including average utilization, uptime percentage, and the number of nodes that are currently down. It further provides reports on overall resource utilization (including CPU cores, memory, and GPUs) and aggregates data across multiple clusters when applicable. Users can also access reports on job queue states and waiting times to optimize workload distribution and system efficiency.

The following metrics provide insights into job statuses, node utilization, and system performance over a 24-hour period:

Job Stats

24hr Total Jobs: The total number of jobs that have been queued, executed, or completed in the past 24 hours.
Running Jobs: The current total number of active running jobs.
Queued Jobs: The current total number of jobs in the queue.
Unrunnable Jobs: The total number of jobs currently in the queue but marked as unable to run.
24hr Successful Jobs: The total number of jobs that successfully completed without exit code errors in the past 24 hours.
24hr Failed Jobs: The total number of jobs that failed due to an exit code error in the past 24 hours.

Note: To view the list of jobs filtered by a specific type of job, right-click on a specific attribute and click Show Jobs. The Job Table displays the specific type of jobs.

For example, you can view only running jobs in the Job Table.

Node/Tasker Stats

24hr Node Utilization (%): The average percentage node utilization over the last 24 hours
24hr Node Uptime (%): The percentage of node-hours in the environment that were operational, with values below 100% indicating downtime on at least one node.
Down Nodes: The current number of nodes recognized as down by the WLM.

Note: To view the list of nodes filtered by a specific node type, right-click on a specific attribute under Node/Tasker Stats and click Show Jobs. The Node/Tasker Table displays the jobs of specific type.

For example, you can view only running nodes in the Node/Tasker Table.

Cluster Utilization Report

The Cluster Utilization report displays the overall utilization percentage of the HPC environment, categorized by resources such as nodes, CPU cores, memory, and GPUs. If multiple clusters are registered with InsightPro, the data is presented as a consolidated view across all clusters.

Node State History Report

The Node State History report presents node counts categorized by state over time.

Note: Node states are periodically checked for this graph, so short-term variations in node states or brief-running jobs may not be accurately reflected.

Current Jobs Report

The Current Jobs report provides a breakdown of the number of jobs in each queue, categorized by job state.

Avg. Queue Wait Times (hr) Report

The Avg. Queue Wait Times (hr) report provides a breakdown of the average wait time for jobs in the queue over the past 24 hours, categorized by queue.