MHM2
The code is written in UPC++
Intro
- Multiple UPC++ backend: ibv, mpi, smp, udp
- When based on mpi, UPC++ backend use the infiniband by default.
- There is no significant performance difference between mpi and ibv.
- The performance degradation after the increase of nodes is more serious than expected: more # of compute nodes: better DHT performance, but more network overhead.
- Will be discussed in next few slides.
- Profiling is a little bit difficult.
\[ \begin{array}{llrrrr} Conduit & Build Type & Report & System CPU & User CPU & nodes \ \hline \textcolor{red}{mpi} & \textcolor{red}{Release} & \textcolor{red}{37.36} & \textcolor{red}{02: 54.9} & \textcolor{red}{1: 35: 15} & \textcolor{red}{4} \ \hline mpi & Release & 60.74 & 01: 37.4 & 1: 19: 27 & 2 \ \hline \textcolor{red}{ibv} & \textcolor{red}{Release} & \textcolor{red}{37.27} & \textcolor{red}{02: 57.3} & \textcolor{red}{1: 36: 37} & \textcolor{red}{4} \ \hline ibv & Release & 61.69 & 01: 36.6 & 1: 19: 33 & 2 \ ibv & Debug & 112.3 & 03: 44.6 & 4: 54: 57 & 4 \ mpi & Debug & 134.4 & 06: 11.6 & 5: 57: 13 & 4 \ mpi & Release & 37.79 & 07: 31.1 & 1: 39: 17 & 4 \ mpi & Release & 545.35 & 1: 18: 27 & 18: 15: 26 & 4 \ mpi & Release & 104.88 & 02: 54.6 & 1: 08: 33 & 1 \end{array} \]
Profiling
- Profiler: Intel Vtune Amplifier/Profiler, Version 2019.6 UPC++ could rely on MPI, but infiniband has to be disabled to profile MPI model.
CPU utilization will be 80% if hyperthreading is disabled.
- Basically overall overhead is insignificant for small dataset (800MB)
- For large dataset (40GB), overhead is not neglectable
- Not I/O bounded, network is the bottleneck
- A lot of data exchange between nodes
- We exam the following two aspects: k-mers and DHT period
DHT Analysis
- Three period: write only, read&write, read only.
- Write only part: data will be storage localized.
- Hyperscale data transmission when read-only: all to all.
- Bottleneck: Transmission restrictions cause function await. This is mutually corroborated by the rate of performance degradation when the number of nodes increases: How to improve efficiency on larger clusters?
Innovation
-
Highly redundant distributed hash table:
- Reduce the order of the complete graph: as long as the memory allows.
- Transfer data when write-only period: Network IO not significant, generating a redundant
- For cluster with more memory: multiple redundancy.
- Both reduce compute-alns part and read-only part
-
Data reduction
- Raid5-like Memory model
- Using XOR to compute the data
-
Hyperparameter configuration
- Adjust k value in k-mers analysis
- We can achieve better results and less time comsumption by tuning the k parameter.
Lesson Learned
- Setting up environment in the cluster
- Use Spack and Module to manage user-mode packages.
- Learn how to use PBS and Slurm
- Need balance between core occupied and waiting time.
- Any optimization in parallel program is very difficult.
- Need to thoroughly consider Network, IO, Memory and core scheduling.
- Profiling in UPC++ can be hard:
- Try to use other parallelization methods.