NVIDIA V100 versus NVIDIA Volta RTX 8000

Systems

DGX 1
8x NVIDIA V100 (16 GB, max. 4 GPUs used)
2x Intel Xeon E5 2698 v4
512 GB RAM
DGX OS 4.0.7 + NVIDIA Driver 410.129
RTX Server
4x NVIDIA Quadro RTX 8000 (48 GB)
2x Intel Xeon Gold 6126
384 GB RAM
CentOS 7.5 + NVIDIA Driver 440.44
nanoFluidX Software Stack
nanoFluidX with single precision floating point arithmetic
CUDA 8.0 GA2 (8.0.61)
Open MPI 2.1.6 (with CUDA support)

Results

Minimal Cube
Simple cube of static fluid particles at rest
Minimal case to estimate raw performance of solver core
Two sizes:
  • Small: ~7m fluid particles (size of relatively small production case, slightly smaller than what we usually recommend for four GPU's)
  • Big: ~57m fluid particles (size of a bigger production case)
All runs over 1,000 timesteps
Figure 1.


Dambreak
Collapsing water column under gravity in domain (indicated by lines)
Good case to evaluate performance for tricky particle distributions and indicate whether the load balancing works
Two sizes:
  • Small: ~7m fluid particles and ~9m total (slightly smaller than what is recommended for four GPU's)
  • Big: ~54m fluid particles and ~64m total (size of bigger production case)
All runs cover 10,000 timesteps
Figure 2.


Altair E-Gearbox
Showcase by Altair for E-Mobility application
Size: ~6.5m fluid particles and ~12m total (slightly smaller than what is recommended for four GPU's)
Run covers 10,000 timesteps
Figure 3.


Aerospace Gearbox
Another showcase for aerospace gearbox applications
Chosen to have another slightly bigger case to fully benefit from four or more GPU's
Size: ~21m fluid particles and ~26.7m total
Run covers 10,000 timesteps
Figure 4.


Conclusions

Results on RTX are generally comparable to DGX-1 with same number of used GPU's. Theoretical performance of the solver core is approximately 10 percent slower on RTX based on Figure 1. This performance difference does not appear in production cases. Generally, performance differences up to 10 percent are not critical as they are not usually noticeable and often depend on minor differences. It is difficult to trace a specific cause because operations and features of the code have different performance characteristics.

Additional Notes

  • All cases were run with the WEIGHTED particle interaction scheme.
  • All solver output has been deactivated to focus on solver performance, but generally this does not change the results significantly.
  • Slight performance uncertainty because of CUDA version as newer versions might apply more tailored optimizations. Compilation for V100 and RTX compute capabilities is not possible in CUDA 8.0.
  • Scalability between one and two GPU's is usually slightly impaired because some parts related to multi-GPU may be skipped entirely in single GPU runs.