Multi-GPU Scaling

Simulation Setup

FZG case (Technical University of Munich)
Single phase
dx = 0.3 mm →≈ 50 million particles
2180 RPM
1 MPI process for each GPU
OpenMPI CUDA-aware library (shipped with nanoFluidX)

Simulations ran for equal number of time steps, therefore limiting the physical time simulated by each case and emphasizing the realistic difference in performance between different hardware configurations.

Figure 1. Official performance numbers by NVIDIA

Systems

K20: GPU scaling solution; Dual socket Intel® Xeon® CPU E5 2630 v2 @ 2.60GHz (12 cores); Two NVIDIA TESLA K20, one for each socket; Two Mellanox Connect IB FDR, one for each socket
K80: Dense GPU solution; Dual socket Intel® Xeon® CPU E5 2670 v3 @ 2.30GHz (24 cores); Four NVIDIA TESLA K80, attached to each socket as pair of two; No IB considered
P100+FDR: Balanced intra-node/inter-node GPU solution; Single socket Intel® Xeon® CPU E5 2650 v4 @ 2.20GHz (12 cores); Four NVIDIA TESLA P100 16 GByte PCIe, all attached via PLX single socket; Single Mellanox Connect IB FDR, one for each socket (upgrading to EDR)
Note: Cambridge studies used nanoFluidX 2.09
NVIDIA DGX-2: Dual socket Intel® Xeon® Platinum 8168 CPU @ 7 0GHz (48 cores total); 16x NVIDIA ® Tesla V100
Note: Using nanoFluidX 2019.1.1

Results

Excellent scaling is observed in all cases.

Figure 2. Wall time of the simulation (s). Circled numbers indicate 12.4 times faster performance

Figure 3.

Conclusions

Scaling is demonstrated for all four GPU models (K20, K80, P100 and V100), despite having limited number of test points for P100 and V100 cards. Strong scaling efficiency is dependent on the size of the case, implying a sufficient computational load per GPU in order to maintain scaling efficiency. For nanoFluidX, this has been observed to be more than two million particles per GPU.