Comparison of Common GPU Models: NVIDIA V100, NVIDIA A100 and NVIDIA A40

Systems

DGX-1
8x NVIDIA V100 (16 GB)
2x Intel Xeon E5 2698 v4
512 GB RAM
DGX OS 5.2.0
NVIDIA Driver 470.141.03
Oracle BM.GPU4.8
8x NVIDIA A100 (40 GB)
2x AMD EPYC 7542
2 TB RAM
Oracle Linux 7.9
NVIDIA Driver 510.85.02
Supermicro AS 4124GS TNR
8x NVIDIA A40 (48 GB)
2x AMD EPYC 7302
1 TB RAM
RHEL 7.9
NVIDIA Driver 460.73.01
nanoFluidX Software Stack
nanoFluidX 2022.3 with single precision floating point arithmetics
CUDA 11.6.2
Open MPI 4.1.4 (with CUDA support)

Results

Minimal Cube
Simple cube of static fluid particles in rest
Minimal case to estimate raw performance of solver core
Two sizes:
  • Small: ~7m fluid particles (size of relatively small production case, slightly smaller than what is recommended for four GPUs)
  • Big: ~57m fluid particles (size of a bigger production case)
All runs cover 1,000 timesteps


Figure 1.
Dambreak
Collapsing water column under gravity in domain (indicated by lines)
Good case to evaluate performance for tricky particle distributions
Therefore indicates whether load balancing works
Two sizes:
  • Small: ~7m fluid particles and ~9m total (slightly smaller than what is recommended for four GPUs)
  • Big: ~54m fluid particles and ~64m total (size of a bigger production case)
All runs cover 10,000 timesteps


Figure 2.
Altair E-Gearbox


Figure 3.
Showcase by Altair for E-Mobility application
Size: ~6.5m fluid particles and ~12m total (slightly smaller than what is recommended for four GPUs)
Run covers 10,000 timesteps


Figure 4.
Aerospace Gearbox


Figure 5.
Another showcase for aerospace gearbox applications
Chosen to have another slightly bigger case than previous one to fully benefit from four or more GPUs
Size: ~21m fluid particles and ~26.7m total
Run covers 10,000 timesteps


Figure 6.

Additional Notes

  • Performance data in the graphs is always relative to one V100 on the DGX-1.
  • All cases were run with the WEIGHTED particle interaction scheme.
  • All cases were run in single precision.
  • All solver output has been deactivated to focus on solver performance, but generally this does not change the results significantly.