Loading [MathJax]/jax/output/HTML-CSS/jax.js

Introduction

首先,欢迎大家加入GeekPie_HPC,这是一个技术中立,立志和清华比肩的比赛队伍。我们重视输赢的同时,更着重培养的是大家的实践(对各种工具的熟练应用)能力,交流(擅长画饼,接锅和再造血)能力。如果你需要花绝大多数的时间卷 GPA,本社不欢迎你,由于比赛几乎所有时间都落在期中期末考试左右,我们希望的是 YOLO 的精神。

有关如何做一个好的技术分享官,无论是学术、科研,都是在和别人 brain storm 的同时产生价值。你做的工作,只有对别人产生了价值,别人才会 value 你,光做一个技术强者是毫无作用的。希望大家珍惜与优秀的人共事的机会,在 slack周会上看看别人是怎么做的,然后自己也贡献些力所能及的事。

这里是 GeekPie_HPC 托管在 github pages 的第三个 wiki,有一部分在 geekpie 的 wiki.js 里,还有一小部分在 geekpie 校内服务器的conference 上,为了避免有一天运维删库跑路,所以先放 github 上。

本 Wiki 的同时生成静态和动态页面

  • 静态: 使用 GitHub Actions + mdbook 生成。 https://hpc.geekpie.club/wiki/
  • 动态: 使用 wiki.js 生成,支持实时编辑,需要学校内网。 https://wiki.geekpie.club/

添加文件

请直接在 main branch上提交 markdown 文件,在半分钟之后 wiki 就会得到更新。

如果新建了文件,需要同步更新根目录下的 SUMMARY.md 文件。

文件命名采用小端大写法。

权限申请

  1. 请向 murez 申请 Git 仓库的编辑权限,可以在 GeekPie 科创工作室 或 Slack 找到。

  2. 在学校内网,使用上科大邮箱可直接注册 wiki.js。如有困难,请到 Slack #general 频道寻求支援。

有关出入口

招新公告slack用上科大邮箱可以注册,暑期想实习、磕盐可以在黑裙里找到相关信息。校内不定期邀请外校同学和同事和共事者演讲。

Generated GitHub Pages is powered by mdBook.

Algorithm

这里放置有关应用和测试中常见的算法。

DGemm

An problem to resolve widely used in Convolution, HPL, HPCG.

https://zhuanlan.zhihu.com/p/464740681

SPMV

Numerical linear algebra is a basic calculation module in scientific computing. The calculation of solving linear equations, linear least squares problems, eigenvalues ​​and singular values ​​is the most computationally intensive part of scientific computing. With the advent of numerical programming, it is very effective to use complex subroutine libraries to solve such problems. When we write program code that involves linear algebra operations, we usually decompose calculations into basic subroutines such as point multiplication or matrix vector multiplication. As a result, structured programming came into being, specifying basic building blocks and using unique mnemonic names to identify these operations, and in order to improve the efficiency of the algorithmic use of these algebra programs, the names and parameter lists of some of the basic operations were uniformly planned.

From 1973 to 1977, the first "level" Basic Linear Algebra Subprograms (BLAS) identified some kernel operations, mainly Fortran specifications and implementations of subroutines for scalar and vector operations [1]. With the advent of vectors, hierarchical storage, and parallel shared memory machines, 1984-1988 successively improved the second "level" BLAS of matrix vector operations and the third "level" BLAS [2,3] of operations between matrices. Specification. The three "levels" of BLAS are not only the division of its development process, but also a measure of the complexity of the algorithm [4]. In order to further develop the BLAS library, a BLAS technical forum meeting was started at the University of Tennessee symposium in 1995 to discuss the overall functions of BLAS, sparse BLAS, dense BLAS with distributed memory, extended BLAS calculation accuracy, and mixed BLAS calculation accuracy, interval BLAS and extensions to existing BLAS.

With the continuous development of the BLAS benchmark library, it has been able to be applied to many hardware platforms and serves programs related to numerical calculation in various industries. Among them, GeneralMatrix-matrixMultiplication (GEMM) is scientific computing (high-performance computing, machine learning). For basic operations in engineering and data applications, people have been aiming for different computing platforms, trying to find corresponding optimization methods to make their calculations faster.

Problem description

For decades, General Matrix-Matrix Multiplication (GEMM) has been the standard benchmark for computing performance. GEMM is the most commonly used type of computing model in high-performance computing. Whether in the HPC field, such as FFT, convolution, correlation, filtering, etc., or in the field of DeepLearning, such as convolution layers, fully connected layers, etc., its core algorithms can be directly or indirectly converted into matrix multiplication operations. The GEMM calculation formula is as follows:

Cα op(A) op(B)+β C

Among them, op(X) represents matrix X, or transposed XT of matrix X, or conjugate transposed XH of matrix X, α and β are scalars, matrix A, m rows and k columns, matrix B, k rows and n columns, Matrix C, m rows and n columns.

There are two possible combinations of different numerical types and precisions in mixed precision:

  1. All scalar parameters and output parameters (scalar or array) are double precision, and at least one array is single precision. Then the type of combination is as follows: (S = Singlereal, D = Doublereal, C = Singlecomplex, Z = Doublecomplex)
αABβC
DSSDD
DSDDD
DDSDD
ZCCZZ
ZCZZZ
ZZCZZ
  1. The precision of all floating-point parameters must be all single precision or all double precision. All scalar parameters and output parameters (scalars or arrays) are complex numbers (unless due to mathematical calculations, all scalar parameters must be real numbers, such as the sum in HERK). The types of combination are as follows:
αABβC
CSSCC
CSCCC
CCSCC
ZDDZZ
ZDZZZ
ZZDZZ

BLAS implementations are usually optimized for calculation speed for specific machines, so using them can bring significant performance advantages. This competition focuses on the calculation performance of the single-precision real number matrix (SGEMM) on domestic advanced computing platforms. Players can refer to the rocBLAS function library [5, 6] for understanding of the relevant content. The API functions implemented by BatchSgemm in the rocblas function library For: rocblas_sgemm_strided_batched.

As the amount of data continues to increase, the size of matrices that need to be calculated in numerical calculations increases. People have proposed to use multiple batches to accelerate matrix multiplication calculations, because multi-batch matrix multiplication can better utilize the computing resources of hardware calculation accelerators. The sub-matrix in each batch in the calculation has a stride address offset and has the same size. Calculated as follows:

C[istridec]αop(A[istridea])op(B[istrideb])+βC[istridec]i[0.,batch count1]

In order to further improve the calculation efficiency of the matrix, batch and strided calculation strategies are introduced on the basis of the original matrix multiplication. In order to make full use of the performance advantages of the GPU-like heterogeneous accelerator used in the cluster in this calculation method, the function implementation needs to be further optimized.

Test Methods

The example function to implement strided batched matrix-matrix operations is as follows. In order to facilitate the better optimization of the players, the function and its parameters are explained as follows:

sgemm strided batched( sgemm operation trans a, sgemm operation trans b, intm, intn, int k, const float* alpha const float A, int lda, int stride a, const float* B int ldb, int stride b, const float* beta, float C, int ldc, int stride C int batch count) typedef enum sgemm operation_ { operation none = 0, operation transpose = 1, operation conjugate transpose = 2 } sgemm operation;

Input parameters:

Parameter trans_a: sgemm_operation type. Details the format of op (A) used in matrix multiplication

If trans_a = operation_none, then op (A) = A;

If trans_a = operation_transpose, then op (A) = A ';

If trans_a = operation_conjugate_transpose, then op (A) = conjg (A ').

Parameter trans_b: sgemm_operation type. The definition is the same as trans_a;

Parameter m: the number of rows of matrix A, m> 0;

Parameter n: the number of columns in matrix B, n> 0;

Parameter k: the number of columns of matrix A and the number of rows of matrix B, K> 0;

Parameter alpha: is a single-precision real number, representing the scalar coefficient of matrix A;

Parameter A: indicates that the matrix stored in the form of a pointer on the GPU is a single-precision real number matrix A;

Parameter Ida: refers to the size of the first dimension of matrix A when it is actually stored, that is, if the matrix is stored first by row, then Ida K; if it is stored first by column, Ida ≯ M

Parameter stride_a: represents the span from the A matrix to the next matrix;

Parameter B: indicates that the matrix stored in the form of a pointer on the GPU is a single-precision real number matrix B;

Parameter Idb: refers to the size of the first dimension of matrix B in actual storage, and the meaning of details is the same as lda; stride_b represents the span from the start of matrix B to the next matrix;

Parameter beta: a single-precision real number representing the scalar coefficient of matrix C. If beta = 0, you do not need to define matrix C;

Parameter C: indicates that the matrix C stores the elements in the form of pointers as single-precision real numbers;

Parameter Idc: refers to the size of the first dimension of the matrix C in actual storage, with the same details as lda;

Parameter stride_c: represents the span from the start of the C matrix to the next matrix;

Parameter batch_count: indicates the number of sgemm operations in the same batch.

Output parameters

Parameter C: C matrix covering the input.

Claim:

  1. The player performs matrix multiplication performance optimization based on the given interface function. In a given test code, the API interface function cannot be changed. The non-fixed parameter players involved can be tuned by themselves according to the matrix size involved in the calculation;

  2. In the test sample section, the code gives a pseudo-random number generated dense matrix as test data to verify the performance of the algorithm. The main test case sizes are the following three:

MNKBatch
164643230
21281286420
312851225610
  1. The competitor needs to submit the function to implement part of the code, the compilation method of the function library, and the generated dynamic link library, test samples, test process, and the encrypted sequence of the running results.

Note: The contestants use the test script provided by the contest group as the basis to improve the function implementation part. To avoid affecting the contestants' performance, please compile and execute the time-encrypted serial code generated by calculating the corresponding matrix to the designated location on the web page.

SGEMM Kernel Optimization on VEGA

https://github.com/victoryang00/SGEMM_on_VEGA

Reference

  1. C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. ACM Trans.Math.Software,5:308-323,1979.
  2. J. J. Dongarra, J. Du Croz, I. S. Duff, and D. Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math.Software,16:1-28,1990.
  3. J. J. Dongarra, J. Du Croz, D. Hammarling, R. J. Hanson. An extended set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Software, 14:1-32, 399, 1988.
  4. https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
  5. https://i-techx.github.io/iTechX/courses?course_code=CS121

FFT

An algorithm widely used in PDE solutions/ numerical emulation.

Single thread

Parallel FFT

ScalaFFT

DFT

References

  1. https://i-techx.github.io/iTechX/courses?course_code=CS121

MHM2 Adjusting k-mers

A method of visualizing k-mers, the k-mer spectrum, shows the multiplicity of each k-mer in a sequence versus the number of k-mers with that multiplicity. It requires a DHT to store the sequence.

codegen

The default parameters are good enough for the dataset in the competition. codegen

  • Modifying those parameters will influence accuracy.

    • Adding an iteration will slightly increase the # of long sequences, the result still stays in acceptable range, but the speed is about 1/7 slower than original.
    • Removing an iteration will greatly increase speed (about 1/7 compared to original), but the result will differ dramatically from reference.
    • Adjusting the values of k will not make MHM2 much faster/slower, and the result would still be acceptable if k is not changed dramatically.
  • From the paper, we learn that the preset k is good enough for most of the cases

    • Too large k is not fair to low-coverage genomes
    • Too small k may not be able to detect errors produced by the sequencer.

Applications

这里放置GeekPie_HPC参与过各个竞赛中的应用的经历和经验。

CESM

Build & Running

OneKeyConf

./create_newcase -res 0.47x0.63_gx1v6 -compset B -case ../EXP2 -mach pleiades-ivy mkdir nobackup ln -s /home/cesm/data/inputdata_EXP1/ nobackup/inputdata # EXP1: ./xmlchange -file env_run.xml -id DOCN_SOM_FILENAME -val pop_frc.gx1v6.091112.nc ./xmlchange -file env_build.xml -id CESMSCRATCHROOT -val `pwd`'/nobackup/$USER' ./xmlchange -file env_build.xml -id EXEROOT -val `pwd`'/nobackup/$CCSMUSER/$CASE/bld' ./xmlchange -file env_run.xml -id RUNDIR -val `pwd`'/nobackup/$CCSMUSER/$CASE/run' ./xmlchange -file env_run.xml -id DIN_LOC_ROOT -val `pwd`'/nobackup/inputdata' ./xmlchange -file env_run.xml -id DIN_LOC_ROOT_CLMFORC -val `pwd`'/nobackup/inputdata/atm/datm7' ./xmlchange -file env_run.xml -id DOUT_S_ROOT -val `pwd`'/nobackup/$CCSMUSER/archive/$CASE' ./xmlchange -file env_run.xml -id RUN_STARTDATE -val 2000-01-01 ./xmlchange -file env_build.xml -id BUILD_THREADED -val TRUE # edit Macro SLIBS -lnetcdff # edit env_mach_specific ./cesm_setup

ybs.sh

./EXP2.clean_build all ./cesm_setup -clean rm -rf $build_dir ./cesm_setup ./EXP2.build

PBS

##PBS -N dappur ##PBS -q pub_blad_2 ##PBS -j oe ##PBS -l walltime=00:01:00 ##PBS -l nodes=1:ppn=28

Performance Tuning

Trouble Shooting

High sys percentage in top (>20%)

This is apparent this is a communication problem. Try switching to Intel MPI for a terribly low sys percentage (<1%).

ERROR: remap transport: bad departure points

Warning: Departure points out of bounds in remap my_task, i, j = 182 4 8 dpx, dpy = -5925130.21408796 -0.368922055964299 HTN(i,j), HTN(i+1,j) = 72848.1354852604 72848.1354852604 HTE(i,j), HTE(i,j+1) = 59395.4550164223 59395.4550164223 istep1, my_task, iblk = 1095001 182 1 Global block: 205 Global i and j: 35 47 (shr_sys_abort) ERROR: remap transport: bad departure points (shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 182

This error may due to multiple reasons.

One significant one is the bad grid division. We were once using one PE for every processor core so the total number of PEs is not a power of 2. Then we used 128 (or later 256) and the error diminished until it showed up again after 6mos of simulation...

Then another affecting reason is the parameter xndt_dyn, see link. This parameter has already been set to 2 after solving the last problem (originally 1). Then we tried increasing this parameter again, it passed the 6mos simulation, but crashed again after another 3mos. We then continued increasing the value, but it crashes faster. We stopped at about 20mos simulation and turned to GNU compiler version with Intel MPI.

However, this does not mean it's the fault of Intel compiler. Direct comparison between Intel and GNU compilers is unfair because the combination of Intel compiler xndt_dyn=1 and most importantly the correct PE number has not been tried. Maybe try using xndt_dyn=1 from be beginning next time, using Intel compiler.

OpenMP failed

Still no solved, but very promising for improving performance.

fixed in WRF

quest analysis

program goal analysis

what's code is actually doing is to simulate quantum computing.

different bits state - qubits

3 states: 1 0 0/1

store by qreal which is actualy a complex number a+bi (a+b=1), and it can be stated as (0.1231240 00.876876) , also note that because gpu only support float32 computing. So native qreal (precision=4) is not supported in gpu simutation.

/* * Single precision, which uses 4 bytes per amplitude component */ # if QuEST_PREC==1 # define qreal float // \cond HIDDEN_SYMBOLS # define MPI_QuEST_REAL MPI_FLOAT # define MPI_MAX_AMPS_IN_MSG (1LL<<29) // must be 2^int # define REAL_STRING_FORMAT "%.8f" # define REAL_QASM_FORMAT "%.8g" # define REAL_EPS 1e-5 # define absReal(X) fabs(X) // not fabsf(X) - better to return doubles where possible // \endcond /* * Double precision, which uses 8 bytes per amplitude component */ # elif QuEST_PREC==2 # define qreal double // \cond HIDDEN_SYMBOLS # define MPI_QuEST_REAL MPI_DOUBLE # define MPI_MAX_AMPS_IN_MSG (1LL<<28) // must be 2^int # define REAL_STRING_FORMAT "%.14f" # define REAL_QASM_FORMAT "%.14g" # define REAL_EPS 1e-13 # define absReal(X) fabs(X) // \endcond /* * Quad precision, which uses 16 bytes per amplitude component. * This is not compatible with most GPUs. */ # elif QuEST_PREC==4 # define qreal long double // \cond HIDDEN_SYMBOLS # define MPI_QuEST_REAL MPI_LONG_DOUBLE # define MPI_MAX_AMPS_IN_MSG (1LL<<27) // must be 2^int # define REAL_STRING_FORMAT "%.17Lf" # define REAL_QASM_FORMAT "%.17Lg" # define REAL_EPS 1e-14 # define absReal(X) fabsl(X) // \endcond # endif

many matrices computation

image-20200206182352740

all of the gate corresponds to one of the manipulation on qubits.

Basic operation on a and b https://arxiv.org/pdf/quant-ph/0207118.pdf

random variables = density matrix

hermitian:ρ=ρ

positive semidefinite: eigenvalue 0

trace: Σ(diagnal elements)=1

dirac notation: ket vϕ=|ϕ=(ϕ0phi1)

bra vϕ=ϕ|=(ϕ0ϕ1)

ϕψ= inner products of bra(fi) and ket(theta). notice: ϕϕ=1

|ϕ|ψ=tensor product of ket(fi) and bra(theta)

2 special notation: u0=|0=(1 0)v1=|1=(0 1)

the dense matrix:ρ=(q00 0q1) (q0+q1=1, the purpose of the equation is to illustrate the complex number ) can be stated as ρ=q0|00|+q1|11|

so ρ|0=(q0|00|+q1|11|)|0=q0|0

dot product (from normal bits to qubits):|ab=|a|b=v00|00+v01|01+v10|10v11|11[v00 v01 v10 v11]

for example in bits 5 = 101b, while in qubits |53=|101=|1|0|1=(0 1)(1 0)(0 1)=(0 0 0 0 0 1 0 0)

Hadamard gate operations

H(|0)=12|0+12|1=:|+

H(|1)=12|012|1=:|

H(12|0+12|1)=12(|0+|1)+12(|0|1)=|0

H(12|012|1)=12(|0+|1)12(|0|1)=|1

corresponding matrix operation in dirac notation: H1=12(11 11)

some specialty:

  1. H=|0+|120|+|0|121|
  2. Since HH=I where I is the identity matrix, H is a unitary matrix (like all other quantum logical gates). Also, it is its own unitary inverse, H=H.

One application of the Hadamard gate to either a 0 or 1 qubit will produce a quantum state that, if observed, will be a 0 or 1 with equal probability (as seen in the first two operations). This is exactly like flipping a fair coin in the standard probabilistic model of computation. However, if the Hadamard gate is applied twice in succession (as is effectively being done in the last two operations), then the final state is always the same as the initial state.

__global__ void statevec_hadamardKernel (Qureg qureg, const int targetQubit){ // ----- sizes long long int sizeBlock, // size of blocks sizeHalfBlock; // size of blocks halved // ----- indices long long int thisBlock, // current block indexUp,indexLo; // current index and corresponding index in lower half block // ----- temp variables qreal stateRealUp,stateRealLo, // storage for previous state values stateImagUp,stateImagLo; // (used in updates) // ----- temp variables long long int thisTask; // task based approach for expose loop with small granularity const long long int numTasks=qureg.numAmpsPerChunk>>1; sizeHalfBlock = 1LL << targetQubit; // size of blocks halved sizeBlock = 2LL * sizeHalfBlock; // size of blocks // ---------------------------------------------------------------- // // rotate // // ---------------------------------------------------------------- // //! fix -- no necessary for GPU version qreal *stateVecReal = qureg.deviceStateVec.real; qreal *stateVecImag = qureg.deviceStateVec.imag; qreal recRoot2 = 1.0/sqrt(2.0); thisTask = blockIdx.x*blockDim.x + threadIdx.x; if (thisTask>=numTasks) return; thisBlock = thisTask / sizeHalfBlock; indexUp = thisBlock*sizeBlock + thisTask%sizeHalfBlock; indexLo = indexUp + sizeHalfBlock; // store current state vector values in temp variables stateRealUp = stateVecReal[indexUp]; stateImagUp = stateVecImag[indexUp]; stateRealLo = stateVecReal[indexLo]; stateImagLo = stateVecImag[indexLo]; stateVecReal[indexUp] = recRoot2*(stateRealUp + stateRealLo); stateVecImag[indexUp] = recRoot2*(stateImagUp + stateImagLo); stateVecReal[indexLo] = recRoot2*(stateRealUp - stateRealLo); stateVecImag[indexLo] = recRoot2*(stateImagUp - stateImagLo); } void statevec_hadamard(Qureg qureg, const int targetQubit) { int threadsPerCUDABlock, CUDABlocks; threadsPerCUDABlock = 128; CUDABlocks = ceil((qreal)(qureg.numAmpsPerChunk>>1)/threadsPerCUDABlock); statevec_hadamardKernel<<<CUDABlocks, threadsPerCUDABlock>>>(qureg, targetQubit); }

Pauli-X/Y/Z gate

The Pauli-X gate acts on a single qubit. It is the quantum equivalent of the X=[01 10]

void pauliX(Qureg qureg, const int targetQubit) { validateTarget(qureg, targetQubit, __func__); statevec_pauliX(qureg, targetQubit); if (qureg.isDensityMatrix) { statevec_pauliX(qureg, targetQubit+qureg.numQubitsRepresented); } qasm_recordGate(qureg, GATE_SIGMA_X, targetQubit); }

the real computing part

void statevec_pauliXLocal(Qureg qureg, const int targetQubit) { long long int sizeBlock, sizeHalfBlock; long long int thisBlock, // current block indexUp,indexLo; // current index and corresponding index in lower half block qreal stateRealUp,stateImagUp; long long int thisTask; const long long int numTasks=qureg.numAmpsPerChunk>>1; // set dimensions sizeHalfBlock = 1LL << targetQubit; sizeBlock = 2LL * sizeHalfBlock; // Can't use qureg.stateVec as a private OMP var qreal *stateVecReal = qureg.stateVec.real; qreal *stateVecImag = qureg.stateVec.imag; # ifdef _OPENMP # pragma omp parallel \ default (none) \ shared (sizeBlock,sizeHalfBlock, stateVecReal,stateVecImag) \ private (thisTask,thisBlock ,indexUp,indexLo, stateRealUp,stateImagUp) # endif { # ifdef _OPENMP # pragma omp for schedule (static) # endif for (thisTask=0; thisTask<numTasks; thisTask++) { thisBlock = thisTask / sizeHalfBlock; indexUp = thisBlock*sizeBlock + thisTask%sizeHalfBlock; indexLo = indexUp + sizeHalfBlock; stateRealUp = stateVecReal[indexUp]; stateImagUp = stateVecImag[indexUp]; stateVecReal[indexUp] = stateVecReal[indexLo]; stateVecImag[indexUp] = stateVecImag[indexLo]; stateVecReal[indexLo] = stateRealUp; stateVecImag[indexLo] = stateImagUp; } } } void statevec_pauliXDistributed (Qureg qureg, ComplexArray stateVecIn, ComplexArray stateVecOut) { long long int thisTask; const long long int numTasks=qureg.numAmpsPerChunk; qreal *stateVecRealIn=stateVecIn.real, *stateVecImagIn=stateVecIn.imag; qreal *stateVecRealOut=stateVecOut.real, *stateVecImagOut=stateVecOut.imag; # ifdef _OPENMP # pragma omp parallel \ default (none) \ shared (stateVecRealIn,stateVecImagIn,stateVecRealOut,stateVecImagOut) \ private (thisTask) # endif { # ifdef _OPENMP # pragma omp for schedule (static) # endif for (thisTask=0; thisTask<numTasks; thisTask++) { stateVecRealOut[thisTask] = stateVecRealIn[thisTask]; stateVecImagOut[thisTask] = stateVecImagIn[thisTask]; } } }
__global__ void statevec_pauliXKernel(Qureg qureg, const int targetQubit){ // ----- sizes long long int sizeBlock, // size of blocks sizeHalfBlock; // size of blocks halved // ----- indices long long int thisBlock, // current block indexUp,indexLo; // current index and corresponding index in lower half block // ----- temp variables qreal stateRealUp, // storage for previous state values stateImagUp; // (used in updates) // ----- temp variables long long int thisTask; // task based approach for expose loop with small granularity const long long int numTasks=qureg.numAmpsPerChunk>>1; sizeHalfBlock = 1LL << targetQubit; // size of blocks halved sizeBlock = 2LL * sizeHalfBlock; // size of blocks // ---------------------------------------------------------------- // // rotate // // ---------------------------------------------------------------- // //! fix -- no necessary for GPU version qreal *stateVecReal = qureg.deviceStateVec.real; qreal *stateVecImag = qureg.deviceStateVec.imag; thisTask = blockIdx.x*blockDim.x + threadIdx.x; if (thisTask>=numTasks) return; thisBlock = thisTask / sizeHalfBlock; indexUp = thisBlock*sizeBlock + thisTask%sizeHalfBlock; indexLo = indexUp + sizeHalfBlock; // store current state vector values in temp variables stateRealUp = stateVecReal[indexUp]; stateImagUp = stateVecImag[indexUp]; stateVecReal[indexUp] = stateVecReal[indexLo]; stateVecImag[indexUp] = stateVecImag[indexLo]; stateVecReal[indexLo] = stateRealUp; stateVecImag[indexLo] = stateImagUp; } void statevec_pauliX(Qureg qureg, const int targetQubit) { int threadsPerCUDABlock, CUDABlocks; threadsPerCUDABlock = 128; CUDABlocks = ceil((qreal)(qureg.numAmpsPerChunk>>1)/threadsPerCUDABlock); statevec_pauliXKernel<<<CUDABlocks, threadsPerCUDABlock>>>(qureg, targetQubit); }

source code analysis

tree

. ├── CMakeLists.txt ├── include │ ├── QuEST_complex.h //determine to use native external cpp support or c complex support. │ ├── QuEST.h //main func claim │ └── QuEST_precision.h //define the precision └── src ├── CMakeLists.txt ├── CPU │ ├── CMakeLists.txt │ ├── QuEST_cpu.c │ ├── QuEST_cpu_distributed.c //distributed activator and implementation │ ├── QuEST_cpu_internal.h //other cpu related headers here │ └── QuEST_cpu_local.c //only cpu implementation ├── GPU │ ├── CMakeLists.txt │ └── QuEST_gpu.cu //gpu counterpart ├── mt19937ar.c //梅森旋轉算法-伪随机数矩阵生成 ├── mt19937ar.h ├── QuEST.c //main func definition ├── QuEST_common.c //func activator defined here ├── QuEST_debug.h //debug information here ├── QuEST_internal.h ├── QuEST_qasm.c //is a quantum record standard, defined qasm assertion here. ├── QuEST_qasm.h ├── QuEST_validation.c //assert number of qubit here └── QuEST_validation.h

https://www.quantum-inspire.com/kbase/introduction-to-quantum-computing

testcase analysis

mytimer.hpp

#include <time.h> #include <sys/time.h> double get_wall_time(){ /* A time value that is accurate to the nearest microsecond but also has a range of years. */ struct timeval time; // __time_t tv_sec; /* Seconds. */ // __suseconds_t tv_usec; /* Microseconds. */ if (gettimeofday(&time,NULL)){ // Handle error return 0; } return (double)time.tv_sec + (double)time.tv_usec * .000001; } double get_cpu_time(){ return (double)clock() / CLOCKS_PER_SEC;//directly read clock from cpu, and return with clock times cloacks per sec. }```

random.c - random manipulation

// total number of qubit: 30 // total number of qubit operatations: 667 // estimated time: 3783.9266747315614 second. #include "QuEST.h" #include "mytimer.hpp" #include "stdio.h" int main(int narg, char *argv[]) { QuESTEnv Env = createQuESTEnv(); double t1 = get_wall_time();//define starting time FILE *fp = fopen("probs.dat", "w");//open file for result if (fp == NULL) { printf(" open probs.dat failed, Bye!"); return 0; } FILE *fvec = fopen("stateVector.dat", "w"); if (fp == NULL) { printf(" open stateVector.dat failed, Bye!"); return 0; } Qureg q = createQureg(30, Env);//define qubits registers float q_measure[30];// defined q's size // possible execution. tGate(q, 25); controlledNot(q, 28, 21); controlledRotateX(q, 17, 5, 0.3293660327520663); tGate(q, 3); rotateX(q, 10, 4.734238389048838); rotateY(q, 8, 4.959946047271496); rotateZ(q, 5, 1.0427019597472071); pauliZ(q, 0); ... printf("\n"); for (long long int i = 0; i < 30; ++i) { q_measure[i] = calcProbOfOutcome(q, i, 1); printf(" probability for q[%2lld]==1 : %lf \n", i, q_measure[i]); fprintf(fp, "Probability for q[%2lld]==1 : %lf \n", i, q_measure[i]); } fprintf(fp, "\n"); printf("\n"); for (int i = 0; i < 10; ++i) { Complex amp = getAmp(q, i); printf("Amplitude of %dth state vector: %12.6f,%12.6f\n", i, amp.real, amp.imag); } double t2 = get_wall_time(); printf("Complete the simulation takes time %12.6f seconds.", t2 - t1); printf("\n"); destroyQureg(q, Env); destroyQuESTEnv(Env); return 0; }

GHZ_QFT.c - only controlled manipulation

/* GHZ quantum circuit */ hadamard(q, 0); controlledNot(q, 0, 1); controlledNot(q, 1, 2); controlledNot(q, 2, 3); controlledNot(q, 3, 4); controlledNot(q, 4, 5); controlledNot(q, 5, 6); controlledNot(q, 6, 7); controlledNot(q, 7, 8); controlledNot(q, 8, 9); controlledNot(q, 9, 10); controlledNot(q, 10, 11); controlledNot(q, 11, 12); controlledNot(q, 12, 13); controlledNot(q, 13, 14); controlledNot(q, 14, 15); controlledNot(q, 15, 16); controlledNot(q, 16, 17); controlledNot(q, 17, 18); controlledNot(q, 18, 19); controlledNot(q, 19, 20); controlledNot(q, 20, 21); controlledNot(q, 21, 22); controlledNot(q, 22, 23); controlledNot(q, 23, 24); controlledNot(q, 24, 25); controlledNot(q, 25, 26); controlledNot(q, 26, 27); controlledNot(q, 27, 28); controlledNot(q, 28, 29); /* end of GHZ circuit */ /* QFT starts */ hadamard(q, 0); controlledRotateZ(q, 0, 1, 1.5708); hadamard(q, 1); controlledRotateZ(q, 0, 2, 0.785398); controlledRotateZ(q, 1, 2, 1.5708); hadamard(q, 2); controlledRotateZ(q, 0, 3, 0.392699); controlledRotateZ(q, 1, 3, 0.785398); controlledRotateZ(q, 2, 3, 1.5708); ...

available test machine

  1. 2node 16core each mpi:omp=2:16

    #!/bin/sh module purge spack load intel ##openmpi@3.1.5/3.1.2 export PRECISION=4 ##1/2/4 CC=icc CXX=icpc cmake -DGPUACCELERATED=0 -DDISTRIBUTED=1 .. make export OMP_NUM_THREADS=16 export FI_PROVIDER=tcp mpirun -machinefile mac -np 2 ./demo

    profiling result

    the most time-consuming part is statevec_compactUnitaryLocal

  2. 2node 16core each mpi:omp=1:32

    image-20200206180552841
  3. 1node 1tesla v100

    script

    #!/bin/sh module purge spack load gcc@6 spack load cuda@10.1 ## 10.2 export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/cuda/lib64 export PRECISION=2 ##1/2 CC=gcc CXX=g++ cmake -DGPUACCELERATED=1 -DGPU_COMPUTE_CAPABILIty=70 .. make ./demo

    profiling result

summary

image-20200206200701826

the summary for profiling of both cpu and gpu, the most time is consumed on computing the real kernel which I think the computing power is fully utilized.

Accelerated percentage of single node over omp+mpi is 319.799/220.807=1.448319120317744

Accelerated percentage of single node over single gpu is 319.799/19.328=16.54627720533642

power consumption: over cpu:image-20200206203642478

​ over gpu: 111W on average

Our future plan:

  1. deploy the gpu code on multigpu using nccl.
  2. solve the global memory store and load efficiency.

misc

Loves from Github

  1. https://github.com/QuEST-Kit/QuEST/issues/220
Hi Jiachen, There are no plans currently to combine distribution with GPU-acceleration. Note there are a few ways this can be done, and I suspect none really align with QuEST's design philosophy, nor are practical due to memory overheads. I've wanted to pen these thoughts for a while, so read on below if interested! :) Firstly, QuEST uses its hardware to accelerate the simulation of a single quantum register at a time. While I think there are good uses of multi-GPU to speedup simultaneous simulation of multiple registers, this would be a totally new pattern to QuEST's simulation style. So let's consider using multi-GPU to accelerate a single register. There are a few ways you can have "multiple GPUs": multiple NVlinked GPUs This is when you have multiple GPUs tightly connected with a high-bandwidth fabric (e.g. this). The bandwidth is enough that you sort of can imagine it as a single big GPU, and hence it would be worthwhile for accelerating single-register simulation. However, this only exists right now as NVLink and NVSwitch, compatible only with IBM's POWER architecture - you could argue this is still esoteric, and not worth a big refactor. Note it wouldn't actually be very hard to refactor QuEST for this platform - indeed QuEST works out-of-the-box with POWER8. But it's not something on our TODO list currently. multiple local GPUs This is when you have multiple GPUs on the same machine, but maybe on different sockets and hence with a much lower bandwidth between them. The most common case is two GPUs - is it worthwhile using two GPUs over one to speedup single register simulation? Often, no! In big QC simulation, having to move memory around is often the big killer, and should be avoided where possible. Unfortunately, simulating unitaries on registers often requires moving memory. If all the memory stays in the GPU (very high "internal bandwidth"), this is ok, but copying memory to the other GPU (across the socket) will introduce a huge per-gate overhead! Hence, using two GPUs to simulate the same register size can be slower than using just one, especially as the simulation size grows and saturates the sockets! There's hardly a benefit from the extra VRAM too, because doubling the memory enables simulation of one additional qubit. This is not worth the slowdown, or the hardware! Even with more than two GPUs, the connections are likely hierarchical and so even more prone to saturation. distributed GPUs This is when you have a GPU(s) on each distributed node of a cluster. In this circumstance, simulating a unitary gate which requires data exchange not only costs us a VRAM to RAM overhead (similar to before), but a networking overhead to talk to the other nodes! This can be somewhat improved by having a direct GPU to network-card connection (and MPI abstraction), but I believe that's pretty cutting-edge. Let's say you have n nodes, each with a GPU and a multicore CPU, and you're resolved to a distributed simulation. When is it worthwhile to pay the extra memory overhead locally copying from RAM to VRAM (and use the GPU), over using just the CPUs? This is now the same trade-off to consider in the previous cases. So may or may not be worthwhile. TL-DR: besides the somewhat esoteric case of having multiple tightly-connected GPUs, multi-GPU simulation introduces a new memory overhead that doesn't exist in single-GPU simulation. This overhead is almost always way longer than the time the GPU spends simulating the gate. As to whether the whole simulation is sped up by the use of multi-GPU is system and simulation specific.
  1. https://github.com/NVIDIA/nccl/pull/316 This is a PR for people to review and provide feedback on the p2p branch (issue #212).
Looking forward to applying the P2P function to increase the power of my project!
  1. THU published their modified version as ICS best paper
  2. NUDT modified the code using memory offloading to main DRAM with GPU Memory.

ISC

奖项

  • 总冠军一名,授予在整体算例以及现场呈现过程中得分最高的队伍。
  • HPL单项冠军一名,授予HPL比赛成绩最高的队伍。
  • 最受欢迎奖一名,授予比赛期间得到ISC13参会者投票最多的队伍。

命题

HPL等benchmark和其他4项应用以及一道神秘赛题。

ISC 21

Rewind: https://victoryang00.cn/wordpress/2021/06/27/isc-21hui-gu/

AutoTuning 就是一个简单OI题

这题目本来是 NV 内部做 OSU 测试的一个小工具,拿出来给我们做题。题目要求根据不同 rank 之间的数据交换能力,做简单的调优,

Task 1-3: Understand MPI_alltoallv calls

Write a program with an input flag for pattern, on the Niagara cluster using 4 nodes, each with 40 ppn (full), total of 160 ppn, with balanced and unbalanced pattern.

The program should run 1000 iterations of MPI_alltoallv using the following characteristics.

Task 4: Use Go+Front end to visualize the alltoallv pattern

We chose to use Go message passing because of no need for the performance dynamically and draw using antd design graph rather than draw directly using gnuplot2 which is legacy for display. The downsides is the frontend occupy too much memory and takes a little bit longer especially for the wrf.

Task 5: Write a online algorithm to reaffine the pattern to makes it faster.

static calculation using MPI static analysis, and do a DP swap for data red heatmap part. The code

LAMMPS

Problem Discovery

  • Intel Package by W. Michael Brown speed up the CPU performance roughly 2 times on both broadwell and skylake chassis. The only difference is -XCORE-avx512
  • Communication Overhead is extremely unbalanced in Protein case because the Comm::Brick::reverse_comm calls MPI_Waitany too many times. This is solvable by define the grid box.
  • Kokkos by LBNL is extremely useful for resource allocation of GPU. However, GPU does not have aggressive improvement may because of the sparse data.

Result

  • Intel Package Buffer - cache friendly and vectorized
  • FFTW Comparison by project-gemmi/ mostly bfly 3D FFTW operation, which fftw is the best.

Lesson Learned

  • The environment variable setting may affect the efficacy of the execution of the application. Besides, it may affect the efficiency of the application.
  • Architecture may affect the performance.
    • AVX_512 may reduce CPU frequency, hence reduce performance.
  • Multi-nodes do not guarantee the performance improvement.
    • Communication overhead may eat the performance gain.
  • Dedicated package may introduce additional performance gain.
    • Most of the gain comes from the USER-INTEL package (by inte| ${ }^{\oplus}$ ).
  • We found CMake is too smart to deal with the compiler option which trigger to half size of addme array in the half neighbor computation, once we change into make, the problem was solved.
  • The Protein cases still get into segfault when using the Intel Package on NSCC, we roll back to no package for that single case.

GPAW

  • Cython program

    • Pros and Cons of Hybrid MPI/OMP
    • 70% runtime in C, 30% runtime in Python
  • Computation intense program

    • Highly depend on Math library
  • Hybrid MPI/OpenMP program

    • Pros and Cons of Hybrid MPI/OMP
    • Balance of MPI/OpenMP

GPU Accelerated

  • ELPA
    • A highly efficient and highly scalable direct eigensolvers for symmetrix(hermitian) matrices.
    • with this math library, the performance can increase 3x-5x.

Profiling

  • Accroding to the IPM Profile information, we figure out that MPI_Allredce is the most time comsuming.
    • We have tried profile the ratio of MPI and OpenMP since it is a Hybird MPI/OpenMP program, but the performance is unstable since different python use gpaw may have different calculate routines.

Lesson Learned

  • Python GIL Lock sometimes make profile difficult.

  • Cython program usually have time cosuming part at C code, optimize this part.

  • Some General Math Library (such as MKL) may not help a lot with specific program, but some minor specific Library will.

MHM2

The code is written in UPC++

Intro

  • Multiple UPC++ backend: ibv, mpi, smp, udp
    • When based on mpi, UPC++ backend use the infiniband by default.
  • There is no significant performance difference between mpi and ibv.
  • The performance degradation after the increase of nodes is more serious than expected: more # of compute nodes: better DHT performance, but more network overhead.
    • Will be discussed in next few slides.
  • Profiling is a little bit difficult.

\begin{array}{llrrrr}  Conduit & Build Type & Report & System CPU & User CPU & nodes \ \hline \textcolor{red}{mpi} & \textcolor{red}{Release} & \textcolor{red}{37.36} & \textcolor{red}{02: 54.9} & \textcolor{red}{1: 35: 15} & \textcolor{red}{4} \ \hline mpi & Release & 60.74 & 01: 37.4 & 1: 19: 27 & 2 \ \hline \textcolor{red}{ibv} & \textcolor{red}{Release} & \textcolor{red}{37.27} & \textcolor{red}{02: 57.3} & \textcolor{red}{1: 36: 37} & \textcolor{red}{4} \ \hline ibv & Release & 61.69 & 01: 36.6 & 1: 19: 33 & 2 \ ibv & Debug & 112.3 & 03: 44.6 & 4: 54: 57 & 4 \ mpi & Debug & 134.4 & 06: 11.6 & 5: 57: 13 & 4 \ mpi & Release & 37.79 & 07: 31.1 & 1: 39: 17 & 4 \ mpi & Release & 545.35 & 1: 18: 27 & 18: 15: 26 & 4 \ mpi & Release & 104.88 & 02: 54.6 & 1: 08: 33 & 1 \end{array}

Profiling

  • Profiler: Intel Vtune Amplifier/Profiler, Version 2019.6 UPC++ could rely on MPI, but infiniband has to be disabled to profile MPI model.

CPU utilization will be 80% if hyperthreading is disabled.

  • Basically overall overhead is insignificant for small dataset (800MB)
  • For large dataset (40GB), overhead is not neglectable
    • Not I/O bounded, network is the bottleneck
    • A lot of data exchange between nodes
  • We exam the following two aspects: k-mers and DHT period

DHT Analysis

  • Three period: write only, read&write, read only.
  • Write only part: data will be storage localized.
  • Hyperscale data transmission when read-only: all to all.
  • Bottleneck: Transmission restrictions cause function await. This is mutually corroborated by the rate of performance degradation when the number of nodes increases: How to improve efficiency on larger clusters?

Innovation

  • Highly redundant distributed hash table:

    • Reduce the order of the complete graph: as long as the memory allows.
    • Transfer data when write-only period: Network IO not significant, generating a redundant
    • For cluster with more memory: multiple redundancy.
    • Both reduce compute-alns part and read-only part
  • Data reduction

    • Raid5-like Memory model
    • Using XOR to compute the data
  • Hyperparameter configuration

    • Adjust k value in k-mers analysis
    • We can achieve better results and less time comsumption by tuning the k parameter.

Lesson Learned

  • Setting up environment in the cluster
    • Use Spack and Module to manage user-mode packages.
  • Learn how to use PBS and Slurm
    • Need balance between core occupied and waiting time.
  • Any optimization in parallel program is very difficult.
    • Need to thoroughly consider Network, IO, Memory and core scheduling.
  • Profiling in UPC++ can be hard:
    • Try to use other parallelization methods.

WRF

傻逼Fortran,2021年了,居然还有人用Fortran

最好找做气象的人问问有关参数设置的问题,可惜我没找到

这是一个有关地球科学的天气模拟系统,所有有关地球科学和Fortran并行化的其他应用都可以参考一下

Practice case for ISC21 SCC

3 Domain Problem for ISC21 SCC

Install

required libs

HDF5, NetCDF-C, NetCDF-Fortran (手动安装可能更好,需要mpi)

HDF5

./configure --prefix=你的安装路径/hdf5 --enable-fortran --enable-fortran2003 --enable-parallel make -j 48 make install
# vi ~/.bashrc export HDF5=你的安装路径/hdf5 export PATH=$HDF5/bin:$PATH export LD_LIBRARY_PATH=$HDF5/lib:$LD_LIBRARY_PATH export INCLUDE=$HDF5/include:$INCLUDE # source ~/.bashrc

NetCDF-C

./configure --prefix=你的安装路径/netcdf LDFLAGS="-L$HDF5/lib" CPPFLAGS="-I$HDF5/include" CC=mpiicc --disable-dap make -j 48 make install
# vi ~/.bashrc export NETCDF=/usr/local/netcdf export PATH=$NETCDF/bin:$PATH export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH export INCLUDE=$NETCDF/include:$INCLUDE # source ~/.bashrc

NetCDF-Fortran

./configure --prefix=你的安装路径/netcdf CPPFLAGS="-I$HDF5/include -I$NETCDF/include" LDFLAGS="-L$HDF5/lib -L$NETCDF/lib" CC=mpiicc FC=mpiif90 F77=mpiif90 # 与NetCDF-C安装在同一目录下 make -j 48 make install

Advanced lib

PNetCDF A Parallel I/O Library for NetCDF File Access

4个node有负面效果,需要8个node及以上才会和NertCDF有异

pnetcdf.png

安装方法见官网

Main Program

经过测试,使用intelMPI会出现segment fault,OpenMPI则不会,然而intelMPI好像并没有很多提高,可以从stack size的角度尝试修正这个问题。

env setting

intel openmpi hdf5 netcdf

config and build

./configure
checking for perl5... no checking for perl... found /usr/bin/perl (perl) Will use NETCDF in dir: /global/software/centos-7.x86_64/modules/intel/2020.1.217/netcdf/4.7.4 HDF5 not set in environment. Will configure WRF for use without. PHDF5 not set in environment. Will configure WRF for use without. Will use 'time' to report timing information $JASPERLIB or $JASPERINC not found in environment, configuring to build without grib2 I/O... ------------------------------------------------------------------------ Please select from among the following Linux x86_64 options: 1. (serial) 2. (smpar) 3. (dmpar) 4. (dm+sm) PGI (pgf90/gcc) 5. (serial) 6. (smpar) 7. (dmpar) 8. (dm+sm) PGI (pgf90/pgcc): SGI MPT 9. (serial) 10. (smpar) 11. (dmpar) 12. (dm+sm) PGI (pgf90/gcc): PGI accelerator 13. (serial) 14. (smpar) 15. (dmpar) 16. (dm+sm) INTEL (ifort/icc) 17. (dm+sm) INTEL (ifort/icc): Xeon Phi (MIC architecture) 18. (serial) 19. (smpar) 20. (dmpar) 21. (dm+sm) INTEL (ifort/icc): Xeon (SNB with AVX mods) 22. (serial) 23. (smpar) 24. (dmpar) 25. (dm+sm) INTEL (ifort/icc): SGI MPT 26. (serial) 27. (smpar) 28. (dmpar) 29. (dm+sm) INTEL (ifort/icc): IBM POE 30. (serial) 31. (dmpar) PATHSCALE (pathf90/pathcc) 32. (serial) 33. (smpar) 34. (dmpar) 35. (dm+sm) GNU (gfortran/gcc) 36. (serial) 37. (smpar) 38. (dmpar) 39. (dm+sm) IBM (xlf90_r/cc_r) 40. (serial) 41. (smpar) 42. (dmpar) 43. (dm+sm) PGI (ftn/gcc): Cray XC CLE 44. (serial) 45. (smpar) 46. (dmpar) 47. (dm+sm) CRAY CCE (ftn $(NOOMP)/cc): Cray XE and XC 48. (serial) 49. (smpar) 50. (dmpar) 51. (dm+sm) INTEL (ftn/icc): Cray XC 52. (serial) 53. (smpar) 54. (dmpar) 55. (dm+sm) PGI (pgf90/pgcc) 56. (serial) 57. (smpar) 58. (dmpar) 59. (dm+sm) PGI (pgf90/gcc): -f90=pgf90 60. (serial) 61. (smpar) 62. (dmpar) 63. (dm+sm) PGI (pgf90/pgcc): -f90=pgf90 64. (serial) 65. (smpar) 66. (dmpar) 67. (dm+sm) INTEL (ifort/icc): HSW/BDW 68. (serial) 69. (smpar) 70. (dmpar) 71. (dm+sm) INTEL (ifort/icc): KNL MIC 72. (serial) 73. (smpar) 74. (dmpar) 75. (dm+sm) FUJITSU (frtpx/fccpx): FX10/FX100 SPARC64 IXfx/Xlfx Enter selection [1-75] :

dm+sm: OMP+MPI

./compile -j 6 em_real >& build_wrf.log tail -15 build_wrf.log

finish

所有执行文件都在run文件夹中。

Run

for i in ../WRF/run/* ; do ln -sf $i $(数据所在目录) ; done

namelist.input是输入文件,其中有众多参数需要设置,可以参考WRF NAMELIST.INPUT FILE DESCRIPTION

slurm script

#!/bin/bash -l #SBATCH -N 4 #SBATCH --ntasks-per-node=20 #SBATCH --cpus-per-task=2 #SBATCH --ntasks=80 #SBATCH -J wrf3Dom_mpi_80_omp_2 #SBATCH -p compute #SBATCH -t 2:00:00 #SBATCH -o wrf3Dom-%j.out sleep 300 module load NiaEnv/2019b module load intel/2019u4 openmpi/4.0.1 #hdf5/1.10.5 #module load netcdf/4.6.3 ulimit -c unlimited ulimit -s unlimited module list export HDF5=/home/l/lcl_uotiscscc/lcl_uotiscsccs1034/scratch/nonspack/hdf5 export PATH=$HDF5/bin:$PATH export LD_LIBRARY_PATH=$HDF5/lib:$LD_LIBRARY_PATH export INCLUDE=$HDF5/include:$INCLUDE export NETCDF=/home/l/lcl_uotiscscc/lcl_uotiscsccs1034/scratch/nonspack/netcdf export PATH=$NETCDF/bin:$PATH export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH export INCLUDE=$NETCDF/include:$INCLUDE export KMP_STACKSIZE=20480000000 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK cd ~/scratch/pl/orifiles mpirun -np 80 -cpus-per-rank $SLURM_CPUS_PER_TASK ./wrf.exe

Important Notice

stack size and segment fault

ulimit sets the OS limits for the program. KMP_STACKSIZE tells the OpenMP implementation about how much stack to actually allocate for each of the stacks. So, depending on your OS defaults you might need both. BTW, you should rather use OMP_STACKSIZE instead, as KMP_STACKSIZE is the environment variable used by the Intel and clang compilers. OMP_STACKSIZE is the standard way of setting the stack size of the OpenMP threads. Note, that this problem is usually more exposed, as Fortran tends to keep more data on the stack, esp. arrays. Some compilers can move such arrays to the heap automatically, see for instance -heap-arrays for the Intel compiler.

Fortran的OMP进程会在stack里塞一大堆东西,很多时候会爆栈,所以使用Fortran和OMP的应用需要注意export KMP_STACKSIZE=20480000000, 而且gccOMP,iccKMP

Fortran and MPI

不知道是slurm还是Fortran的问题,slurm不能对Fortran的MPI程序自动分配CPU核心,所以需要手动设置,

mpirun -np 16 -cpus-per-rank $SLURM_CPUS_PER_TASK ./wrf.exe

tell mpi how many cpu cores should one mpi rank get for openmp

IPM Report env setting

IPM是一个监控MPI使用的profiler。使用IPM只需要perloadIPM的lib就可以了。但是为了完整生成报告图片,需要设定以下变量

export IPM_REPORT=full export IPM_LOG=full

When using IPM, set above envs to make sure you can get right xml to visualize, or using https://files.slack.com/files-pri/TAXMW9014-F02586VN27L/download/ipm.ipynb to visualize

Others

训练起飞中

Incompact3D

It's the incompressible Navier–Stokes equations using sixth-order compact schemes for spatial discretization. It basically implement a ODE with numerical methods called Multistep Methods.

the Poisson equation is fully solved in spectral space using Fast Fourier Transform (FFT) routines

incopmact_3d_verstility

Intro to the algorithm and implementation

mpi_affinity

2d_mpi_affinity

Test Case Taylor

Build for MKL/FFTW3

Reminder:

  1. Enable MKL speedup on AMD Platform
int mkl_serv_intel_cpu_true() { return 1; }
  1. FFTW3 migrate to CuFFT.

Build libnpc with spack

I don't know why...it fails to build when MPICXX is set...

Here is a quick hack

class Libnbc(AutotoolsPackage): """LibNBC is a prototypic implementation of a nonblocking interface for MPI collective operations. Based on ANSI C and MPI-1, it supports all MPI-1 collective operations in a nonblocking manner. LibNBC is distributed under the BSD license. """ homepage = "http://unixer.de/research/nbcoll/libnbc/" url = "http://unixer.de/research/nbcoll/libnbc/libNBC-1.1.1.tar.gz" version('1.1.1', sha256='63aa5f75f84c191da0688cb551ebd0e9e46928edfba350b2a534eb0c704dd9c3') depends_on("mpi") + def configure_args(self): + args = [] + args.append("MPICXX=") + return args

Reference

  1. Incompact3d: A powerful tool to tackle turbulence problems with up to O(105) computational cores

NWChem

ICON

Prepare

git clone https://gitlab.dkrz.de/icon-scc-isc22/icon-scc cd /path/to/icon-scc git submodule init git submodule update

How to run

spack compile

spack install -j (nproc) -vvvv icon%gcc@6.4.0

There are some varients:

  • debug
  • cuda
  • openmp

run copy scripts

cd {ICON_BUILD_DIR} export ICON_DIR={ICON_DIR} # Copy runscript-related files when building out-of-source: if test $(pwd) != $(cd "${ICON_DIR}"; pwd); then echo "Copying runscript input files from the source directory..." rsync -uavz ${ICON_DIR}/run . --exclude='*.in' --exclude='.*' --exclude='standard_*' ln -sf -t run/ ${ICON_DIR}/run/standard_* ln -sf set-up.info run/SETUP.config rsync -uavz ${ICON_DIR}/externals . --exclude='.git' --exclude='*.f90' --exclude='*.F90' --exclude='*.c' --exclude='*.h' --exclude='*.Po' --exclude='tests' --exclude='rrtmgp*.nc' --exclude='*.mod' --exclude='*.o' rsync -uavz ${ICON_DIR}/make_runscripts . ln -sf ${ICON_DIR}/data ln -sf ${ICON_DIR}/vertical_coord_tables fi

Gen sbatch

cd {ICON_BUILD_DIR} export ICON_DIR={ICON_DIR} cd {ICON_BUILD_DIR}/run $ICON_DIR/utils/mkexp/mkexp standard_experiments/scc.config CO2=2850

OK if

Script directory: '/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/experiments/exp_scc2850/scripts' Data directory: '/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/experiments/exp_scc2850/outdata' Work directory: '/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/experiments/exp_scc2850/work'

Modify sbatch

In experiments/exp_scc2850/scripts/exp_scc2850.run_start

  • FIX SLURM args
  • FIX path
    • no /build/
    • no /home/qinfr
- BUILD_DIR=/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/BUILD + BUILD_DIR=/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/ + export PATH={cdo-1.9.10_BUILD_DIR}/bin:$PATH ...
  • Subsitute all /home/qinfr to /mnt/nfs4/node1/home/qinfr/

Run

sbatch exp_scc2850.run_start

Tips

How to check if compiled code uses SSE and AVX instructions?

https://stackoverflow.com/questions/47878352/how-to-check-if-compiled-code-uses-sse-and-avx-instructions

objdump -d cgribexlib.o | awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmu lsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vf nmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub2 13pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpb roadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherd d|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsll vq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpc mpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcomp ressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2 ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64 x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb |vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvt tpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vc vttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd| vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vf ixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrs qrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|v pminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatter qq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcast mw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgath erpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0q ps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpcl assss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b| vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|v p4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vps hrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec |vaesdeclast|vaesenc|vaesenclast)[ \t]/'

MiniVite

概述

资料

ghosh2018.pdf minivite-indyscc.pdf

算法

对于每个社区,可以计算 Modularity,整个图的 Modularity 就是每个社区的 Modularity 加总。通过改变社区的划分来影响 Modularity.

目标:最大化 Modularity

Louvain method 是迭代算法,初始每个节点属于自己社区。

对于节点 $u$,考虑每个节点的邻居(有边相连的节点)$v$,将 $u$ 的社区改为 $v$ 的社区会对 Modularity 有一个影响量 $\Delta Q$,$\Delta Q$ 是可以快速计算的。遍历所有邻居以后可以得到一个最大的 $\Delta Q$,如果 $\Delta Q>0$,就将 $u$ 的社区改为 $v$ 的社区。

一个迭代就是考虑一遍所有节点 $u$,当 $\Delta\text{Modularity}<\text{threshold}$ 就停止。

并行化:切分图的节点集合,分给每个计算节点一些图的节点,并行考虑图的节点。

代码比较短,可以阅读

编译

可以使用 spack 中的 miniVite, 但是版本比较老,需要改 package.py.

也可以直接用 GitHub 源编译,使用 gcc 可能要把 Makefile 中的 -xHost 替换为 -march=native-qopenmp 替换为 -fopenmp.

https://github.com/ECP-ExaGraph/miniVite

运行

摘抄自 README

mpiexec -n 2 bin/./minivite -f karate.bin mpiexec -n 2 bin/./minivite -l -n 100 mpiexec -n 2 bin/./minivite -n 100 mpiexec -n 2 bin/./minivite -p 2 -n 100 [On Cray systems, pass MPICH_MAX_THREAD_SAFETY=multiple or pass -DDISABLE_THREAD_MULTIPLE_CHECK while building miniVite.] Possible options (can be combined): 1. -f <bin-file> : Specify input binary file after this argument. 2. -b : Only valid for real-world inputs. Attempts to distribute approximately equal number of edges among processes. Irregular number of vertices owned by a particular process. Increases the distributed graph creation time due to serial overheads, but may improve overall execution time. 3. -n <vertices> : Only valid for synthetically generated inputs. Pass total number of vertices of the generated graph. 4. -l : Use distributed LCG for randomly choosing edges. If this option is not used, we will use C++ random number generator (using std::default_random_engine). 5. -p <percent> : Only valid for synthetically generated inputs. Specify percent of overall edges to be randomly generated between processes. 6. -t <threshold> : Specify threshold quantity (default: 1.0E-06) used to determine the exit criteria in an iteration of Louvain method. 7. -w : Only valid for synthetically generated inputs. Use Euclidean distance as edge weight. If this option is not used, edge weights are considered as 1.0. Generate edge weight uniformly between (0,1) if Euclidean distance is not available. 8. -r <nranks> : This is used to control the number of aggregators in MPI I/O and is meaningful when an input binary graph file is passed with option "-f". naggr := (nranks > 1) ? (nprocs/nranks) : nranks; 9. -s : Print graph data (edge list along with weights).

作业

题目

Access the following server and download the two graph inputs (they are in a binary format). Server: "sftp indyscc@N/A" Password: "N/A"

The homework consists of two parts, and each part has two/three questions (checking the appropriate documents from the code repository can save time):

  1. Establishing baseline performance: Download and build the default/main/master branch of miniVite (https://github.com/ECP-ExaGraph/miniVite), run it using the provided com-orkut and webbase-2001 input graphs on 1-20 nodes (to perform strong scaling experiments). Answer the following questions: How are these two input graphs different? What arguments did you choose to run miniVite? Does increasing the number of OpenMP threads help the performance (try 2-3 combinations of threads-per-process, keeping the “processes*threads-per-process” quantity the same)? Why or why not?
  2. Performing further optimizations: Find a combination of miniVite arguments and/or macros (arguments are discussed in the README, but for macros, you may need to look elsewhere), in addition to the baseline arguments/options that you ran miniVite with in the previous step, that improves the overall performance and scalability. Compare baseline performance with the improved version – plot it (X-axis: #Processes(nodes) and Y-axis: “Average total time (in s)” as reported by miniVite), and discuss. Does your set of options affect the output quality (expressed via modularity and MODS) in any way? If so, discuss. Submission Instructions The assignment is assigned to all students. However, a single submission per team is sufficient. One member of the team can submit the assignment. The report can be a PDF file (preferred method) or a link to a google doc (we will check the timestamp for when it was last edited). Please include your team name and the university in the report.

修改 Spack 的 package.py

需要加入一些编译选项,故需要修改编译脚本:

# Copyright 2013-2022 Lawrence Livermore National Security, LLC and other # Spack Project Developers. See the top-level COPYRIGHT file for details. # # SPDX-License-Identifier: (Apache-2.0 OR MIT) from spack.package import * class Minivite(MakefilePackage): """miniVite is a proxy application that implements a single phase of Louvain method in distributed memory for graph community detection. """ tags = ["proxy-app", "ecp-proxy-app"] homepage = "https://hpc.pnl.gov/people/hala/grappolo.html" git = "https://github.com/Exa-Graph/miniVite.git" version("develop", branch="master") version("1.0", tag="v1.0") version("1.1", tag="v1.1") variant("openmp", default=True, description="Build with OpenMP support") variant("opt", default=True, description="Optimization flags") variant("mode",default='default',description="mode",values=('collective','sendrecv','rma','default','rma_accu')) variant("omp_schedule", default=False, description="Enable OMP schedule") variant("use_32_bit_graph", default=False, description="Use 32bit graph") depends_on("mpi") @property def build_targets(self): targets = [] cxxflags = ["-std=c++11 -g -DCHECK_NUM_EDGES -DPRINT_EXTRA_NEDGES"] ldflags = [] if "+openmp" in self.spec: cxxflags.append(self.compiler.openmp_flag) ldflags.append(self.compiler.openmp_flag) if "+opt" in self.spec: cxxflags.append(" -O3 ") if self.spec.variants['mode'].value == 'collective': cxxflags.append("-DUSE_MPI_COLLECTIVES") elif self.spec.variants['mode'].value == 'sendrecv': cxxflags.append("-DUSE_MPI_SENDRECV") elif self.spec.variants['mode'].value == 'rma': cxxflags.append("-DUSE_MPI_RMA") elif self.spec.variants['mode'].value == 'rma_accu': cxxflags.append("-DUSE_MPI_RMA -DUSE_MPI_ACCUMULATE ") if "+omp_schedule" in self.spec: cxxflags.append("-DOMP_SCHEDULE_RUNTIME") if "+use_32_bit_graph" in self.spec: cxxflags.append("-DUSE_32_BIT_GRAPH") targets.append("CXXFLAGS={0}".format(" ".join(cxxflags))) targets.append("OPTFLAGS={0}".format(" ".join(ldflags))) targets.append("CXX={0}".format(self.spec["mpi"].mpicxx)) return targets # 下面省略

本道题目在启用 USE_MPI_RMA 后,性能获得较大提升。

报告

minivite.pdf

评价

  • As a response to the first question, why do you think orkut's running time is longer even though it is smaller in size compared to webbase?
  • How many OpenMP threads per process for the baseline strong scaling experiments?
  • In part 1, you provide a brief discussion on observed load imbalance. But, you do not mention how you mitigated it in part 2.
  • It would have been interesting to modulate the threshold and measure the effect on performance, and check the impact on MODS

3/5 + 5/5 + 30/40 + 25/25 + 18/25 = 81/100

Final

题目

This assignment has two parts, strong scale and weak scale. Like in homework #1, you will download and build miniVite: https://github.com/ECP-ExaGraph/miniVite

Strong scale

Use the com-friendster graph as input to miniVite, and the optimization arguments that you learned about during the last homework to perform strong scaling experiments (any option that improves the performance is acceptable, even if quality in terms of modularity is affected somewhat).

For this input, there will be startup issues (out-of-memory related crash or slowness) if you use a relatively small #nodes to begin or limited optimization arguments.

The goal of this exercise is to find a set of arguments and options (which may differ among process configurations) that maximizes strong scalability for this input, without compromising quality/modularity too much (rounding off final modularity to the first decimal place should yield similar values no matter your choice of optimizations). (Don’t try to use -DDUSE_32BIT_GRAPH, it won’t work)

i. Pick x where x is the startup node, and then scale the #nodes by incrementing x by a fixed stride to get the next process/node configuration, continue until x == 20. (pick any combination of processes-per-node and threads-per-process that yields better performance) You can vary processes-per-node as you see fit. How did you pick the base x?

ii. Report graph loading/construction times, #iterations to convergence, the time to perform the Louvain graph clustering as reported by miniVite running on the nodes as per 1.a.i.

Also, mention the arguments that you passed to miniVite and options you build it with.

Weak scale

Use the miniVite options to generate a distributed input graph (see FAQs and README) that scales with the #processes. Pick a reasonable number of vertices (this is governed by a formula – see FAQ, if miniVite complains, just adjust the #vertices or #processes)

i. Start with 1 node (any #processes-per-node and #threads-per-process configuration that makes sense to you) and end at 20 nodes. Plot the time to generate the graph, time to perform graph clustering (using data returned by miniVite) on 1-20 nodes.

ii. How large is the graph you generated on 20 nodes vs. 1 node? (Larger is better, but too large will take too much time in graph generation).

For submission, Create two directories called weak_scale and strong_scale and put the documents that answers the questions for each category in their respective directories.

题目分析

  1. Strong scaling, 即固定问题规模,增加并行数量,减少运行时间。理想情况是 $\text{time with (n) nodes}=\frac{\text{time with 1 node}}{\text{number of nodes}}$
  2. Weak scaling,即固定每个并行节点的运算量,增加并行数量(问题规模同时增加)。理想情况是运行时间不变(没有任何并行带来的额外开销)。

OpenMPI & OpenMP 调参

每个机器是 2 个 E5-2660 v3,总共 20 cores.

经过一些尝试,OpenMP 开单线程,MPI 开到 20 效果最好。

PPNOMP_NUM_THREADSClustering
201100.445
202102.023
204108.56
123132.041
102146.281

似乎表明 OpenMP 并行效果不如 MPI,可能是 OpenMP atomic 开销太大,但是没有做过 profiling 不能确定。 每个 MPI 进程都会开一个数据结构存储全图所有节点的信息,内存开销大。

mpirun --hostfile ./hostfile -n 400 -map-by core --bind-to core miniVite -f com-friendster.ungraph.bin -b -t 0.0015

Profile

原程序用的 std::set std::map std::unordered_set std::unordered_map 太慢了,换成第三方快速 HashTable 实现能加速很多倍。 原算法开了一个不必要的 vector 也可以优化掉。

profiling_1.png profiling_2.png profiling_3.png

Weak scale

用 miniVite 自带的算法生成图,生成非常慢,而且进程个数必须是 $2^n$,导致只能跑进程数 $1,2,4,8,16,32,64,128,256$。 没有用 oversubscribing 因为不太符合 weak scaling 的意思,而且跑出来数据可能会很难看(指波动大)。

提交

strong-scale-report.pdf weak-scale-report.pdf 0001-modify.patch

HPL & HPL Hero Run

此次 Indy SC 比赛中的 HPL benchmark 分为两组题目,一组为 HPL 在单节点上的调优,另外一组为在整个集群上运行 HPL (约300个节点).

HPL

Hi teams

Here is the assignment on HPL! Compute the theoretical peak FLOPs for the processor type available on Chameleon cloud for the IndySCC competition. (10 points) Build and install the HPL benchmark using your choice of linear algebra library and MPI library. Which linear algebra library did you choose and why? Which MPI library did you choose and why? (20 points) Run the HPL benchmark on a node using a fixed problem size (N) and by varying the number of cores from 2 to 20, doubling the cores at each trial. Which parameters did you need to change? Plot the GFLOPs number for each run vs. no. of cores. (30 points) Run the HPL benchmark using all 20 cores and tune the HPL.dat file to achieve the best GFLOPs number. Which parameters did you tune? What were your results using the unoptimized parameter(s) vs. the optimized parameter(s)? (40 points) (Bonus) Run HPL on 2-nodes. Was the GFLOPs number exactly twice that of your single-node performance? Why or why not? (10 points) Deliverables: Submit a report answering the questions in the assignment. While describing your steps, be brief and to the point. You should also include a description of your cluster and how you created the cluster. For each of the steps 2-5, include the scripts that you used to build and run HPL. Provide sufficient documentation for your codes. Please note that your environment and scripts should be reproducible, i.e., the judges should be able to run them on an empty cluster. For the tuning step 4, include the final HPL.dat file that gave you the best performance. Submission Instructions The assignment is assigned to all students. However, a single submission per team is sufficient. One member of the team can submit the assignment. The report can be a PDF file (preferred method) or a link to a google doc (we will check the timestamp for when it was last edited). Please include your team name and the university in the report.

HPL Hero Run

Between now and Oct 20 you are allowed to burst up to 30 compute nodes to test scaling your node deployment processes. Hero HPL Runs On Oct 20 we will begin hero HPL runs. Each team will get a 23 hour window where they will be allowed to use up 300 compute nodes to complete the best HPL score they can get. Ground Rules During this time, the other teams will only be allowed to use their 1 head node to continue testing. At the beginning of this phase we will shut down any compute nodes still running. Resource usage during this time will be closely monitored - any interference, accidental or otherwise, with the competing team will be penalized per the IndySCC rules and at the discretion of the committee. Schedule Each window will begin at 9 am Eastern Time and will end at 8 am Eastern Time the following day. We will need an hour to make sure your nodes are shut down and ready to go for the next team. The schedule was randomly generated and assigned and is listed below. If you would like to swap dates, you are responsible for finding another team to swap with. You may use the Google classroom to ask if anyone else needs to swap and is willing. Once the first team begins, the schedule will be locked in place. Otherwise, we cannot change any dates unless mutually agreed upon and there isn’t enough time left in the month to have alternative days.

Thu Oct 20 - Durham University Fri Oct 21 - CUHK Sat Oct 22 - Georgia Tech Sun Oct 23 - SUSTech Mon Oct 24 - ShanghaiTech Tue Oct 25 - CSC/Finland Wed Oct 26 - Monash University Thu Oct 27 - Clemson Fri Oct 28 - U Indonesia Sat Oct 29 - UTEP Sun Oct 30 - TAMUCC Helpful Hints A successful team may strategize to run HPL in increasing increments of nodes, rather than try for the full 300 nodes at once. It is OK if you don’t use the full number of nodes with your final score, getting hundreds of nodes to work nicely together in real-life benchmarking exercises takes time and may be difficult to do in the short period you have them. It would be advantageous to have a very good score on a smaller set of nodes than to struggle to get the full 300 running, run out of time, and not have a score at all.

Also keep in mind that this hardware is fairly old and there are likely a handful of bad/slow nodes. This is where slowly increasing the number of nodes you are using can come in handy. Run Into Hardware Problems or Outages?

If you identify a slow node, we will not be able to fix it as the only spare parts are in the form of the other nodes, and it’s unlikely we will be able to fix it by the end of your window. You should exclude the slow node and move onto another node. If you scale up to the full 300 nodes, you may deploy extras, just make note of the slow nodes and document that your final submission uses no more than 300.

If there is any disruption of resource availability during your window, such as a Chameleon outage or power/networking outage at the Purdue site, you may receive some make-up time in certain situations, as described in the next paragraph.

If the resources are cumulatively unavailable for more than 4 hours, at the discretion of the committee, you may receive an equivalent time slot (ie, if you were unable to access for 5 hours, you would get another 5 hours) plus an extra hour for node spin-up at the end to try again. This time would be disjoint and come at the end of the above schedule as we can’t shift the rest of the schedule for the remaining teams.

If you are the last team, we will ask that you shut down at your designated time, as we need time to determine the outage length adjustments for everyone. You’ll be able to restart at a later time, and that will keep things fair as the other teams wouldn’t have the ability to just extend their window.

Outages of less than 4 hours will unfortunately be considered lost time. These shorter blips are part of real life challenges, and it would not be practical to allow make-up time for shorter times as you’d spend more time just spinning nodes back up.

We also cannot accommodate internet outages at your locale as we can’t verify those outages, so you may want to have a plan to find another connection if that happens.

Once all teams are done running, we will open the nodes back up for you to configure for the final 48 hour competition. Submitting results

Final score submissions are due 1 hour after your window ends, right as the next team is starting. We will follow up with more details on this.

报告

单节点 多节点 hero run 相关文件

参考资料

Official Documentation AMD HPL Benchmark Run HPL on Threadripper 基于 HPL测试的集群系统性能分析与优化

方案

调优 HPL.dat

具体过程请参见报告。

在 HPL.dat 中可以使用多组参数,这样跑一次 HPL 可以得到多个测试结果,效率会高一些。 {.is-info}

调优 MPI 参数

进程还是线程

HPL 底层数学库可以利用多线程,所以可以让1个 HPL 进程利用整个 Socket。具体哪种的性能更好需要测试。

Intel 数学库有三套方案:单线程执行,OpenMP 和 Intel TBB,后两者可以利用多核。

具体调优情况参见报告。

绑核

numactl 可以用来绑定 HPL 使用的内存在哪一个 numa 上。 绑核的教程参见: IBM MPI documentation Understanding-MPI-map-by-and-bind-to-option

Author

@Zecheng Li

Tutorial

Video: https://drive.google.com/file/d/1c2bD3gZw5ZeJS81i1uXY6eAXf70t8bZo/view

NAMD 2.14 User’s Guide: https://www.ks.uiuc.edu/Research/namd/2.14/ug/

NAMD Tutorial: http://www.ks.uiuc.edu/Training/Tutorials/namd/namd-tutorial-unix-html/index.html

VMD User’s Guide: https://www.ks.uiuc.edu/Research/vmd/current/ug/

VMD Tutorial: https://www.ks.uiuc.edu/Training/Tutorials/vmd/tutorial-html/

Building

We use spack as the package manager. To build a simple version without MPI support:

spack install namd

In our competition, we need to support multiple nodes, so we choose to install charmpp with MPI backend and SMP enabled. The TCL interface is included for parsing the input file. The below command will depend on the OpenMPI with ucx support.

spack install -v namd%gcc interface=tcl ^charmpp backend=mpi ^openmpi fabrics=ucx

We could also use a pure MPI version. But unlike some other applications, NAMD optimized its performance for multi-threading, so the SMP version is usually faster than MPI when a single node has multiple cores. When running, we should use more communication threads (one per numa) for larger-scale jobs.

Tuning performance

We could try different compilers to build NAMD. The performance critical part of NAMD is the force calculation implemented in the source of NAMD itself (instead of in some math libraries), so compiler optimization is crucial for the performance.

We could try different compilers: (oneapi is not supported by charm++ that NAMD 2.14 depends on)

spack install -v namd%intel interface=tcl ^charmpp%intel backend=mpi ^openmpi fabrics=ucx spack install -v namd%nvhpc interface=tcl ^charmpp%nvhpc backend=mpi ^openmpi fabrics=ucx spack install -v namd%clang interface=tcl ^charmpp%clang backend=mpi ^openmpi fabrics=ucx

As you may find, Intel compiler might yield better performance than gcc, and NVHPC and clang are extremely slow. Don't worry. Have a look at the build scripts of NAMD.

spack edit namd if self.spec.satisfies("^charmpp@:6.10.1"): optims_opts = { "gcc": m64 + "-O3 -fexpensive-optimizations \ -ffast-math -lpthread " + archopt, "intel": "-O2 -ip -qopenmp-simd" + archopt, "aocc": m64 + "-O3 -ffp-contract=fast -ffast-math \ -fopenmp " + archopt, } else: optims_opts = { "gcc": m64 + "-O3 -fexpensive-optimizations \ -ffast-math -lpthread " + archopt, "intel": "-O2 -ip " + archopt, "aocc": m64 + "-O3 -ffp-contract=fast \ -ffast-math " + archopt, }

It did not set the optimization flags for clang and NVHPC. We could add them by ourselves. Below is an example; you should try different flags based on your architecture.

"clang": m64 + "-O3 -ffp-contract=fast -ffast-math -mprefer-vector-width=256 " + archopt, "nvhpc": m64 + "-O3 -fast " + archopt,

Then we could build NAMD with clang and NVHPC. Surprisingly, clang is faster than Intel compiler on our machine (Haswell).

From the charm++ documentation, we could find that we can compile different load balancer modules into NAMD with different flags, but the default spack build script did not include them. Having a suitable load balancer for your architecture and interconnect is important. Some load balancers might cause a compile error since multiple definitions.

-module TreeLB -module RecBipartLB ... links the listed LB modules into an application, which can then be used at runtime via the +balancer option.

spack edit namd # in function def edit(self, spec, prefix): add line opts.extend(["-module CentralLB -module DistributedLB"])

From our experience, the default load balancer, the CentralLB, and the DistributedLB should be chosen based on the input and the architecture. It will bring about a 5-10% performance difference. You can also experiment with other load balancers or even write your own load balancer (not that hard!).

There are also some other flags that could be tuned. Since our machine does not support AVX512, we did not try the Intel-optimized AVX512 blocking version of NAMD.

Assignment

  1. Quick Start & Homework: https://gitlab.msu.edu/vermaaslab/indysccnamd/-/tree/main

  2. Running Command:

    # One per node with SMP mpirun -np 20 -hostfile ~/work/host20 -bind-to core -map-by ppr:1:node -x PATH namd2 +ppn 19 +pemap 1-19 +commap 0 run.namd ​# One per NUMA with SMP (You should check your NUMA topology first) mpirun -np 40 -hostfile ~/work/host20 --bind-to core --map-by ppr:2:node -x PATH namd2 +ppn 9 +pemap 1-4,10-14,5-9,15-19 +commap 0,5 ./run.namd # Run with 19 replicas mpirun -np 19 -hostfile ~/work/host20 --bind-to core --map-by ppr:1:node -x PATH namd2 +replicas 19 +balancer DistributedLB +ppn 20 +pemap 0-19 +commap 0 +stdout output/out.%d.log ./replicaconfig.namd

    Here we oversubscribe the cores. Since core 0 is lightly loaded when communication is not heavy, we can also assign it to computation. Note that in above commands, we always put core 0 or core 5 to communication. This is because we have set most of the interrupt affinity to core 0 and core 5, using them could get better performance on both communication and computation. (it will be up to 5% slower if you use other cores)

  3. MPI Reference: https://www.ks.uiuc.edu/Research/namd/2.9/ug/node87.html

  4. Our submission: https://drive.google.com/file/d/1HqxWP6YJIr06wz6ANMHog3v59HnhV7T2/view?usp=share_link

Final

  1. Problem Set: https://drive.google.com/file/d/1zyWpv-bfN2uzke7RqnpS8PI6Q-AQFtKb/view?usp=share_link

  2. Our Submission: https://drive.google.com/drive/folders/1dpVS6027vJTsbxlOjfMEGdftOfQyCmlO?usp=share_link

Experience

NAMD配置文件参数介绍:

  1. NAMD configuration parameters: https://www.ks.uiuc.edu/Research/namd/2.9/ug/node12.html
  2. Non-bonded Interaction & Parameters: https://www.ks.uiuc.edu/Research/namd/2.10b2/ug/node23.html

调参方法:

  1. 整体思路:在不跑崩的范围内,timestep和nonbondedFreq的乘积、timestep和fullElectFrequency的乘积尽可能大

  2. 输出间隔也对性能有影响,调小后约能提升5-10%性能

  3. Reference:

    1. https://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdPerformanceTuning
    2. https://www.ks.uiuc.edu/Research/namd/cvs/ug/node95.html

SC21

Website: https://sc21.supercomputing.org/program/studentssc/student-cluster-competition/ Rewind: https://victoryang00.cn/wordpress/2021/11/18/sc21-shi-bai-hui-gu/

Quantum Espresso

https://github.com/QEF/q-e

compile

Could not find MPI (Missing MPI_FORTRAN_FOUND)

solve: -DMPIEXEC_EXECTUABLE=${MPI_HOME}/bin/mpiexec

The compiled version does not support OpenMP and only support up to 4 processes for MPI.

Add the options:

-DQE_ENABLE_OPENMP=ON -DCMAKE_Fortran_COMPILER=${MPI_HOME}/bin/mpifort -DOpenMP_C_FLAGS=-fopenmp=lomp -DOpenMP_CXX_FLAGS=-fopenmp=lomp -DOpenMP_C_LIB_NAMES=libomp -DOpenMP_CXXLIB_NAMES=libomp -DOpenMP_libomp_LIBRARY=/usr/lib/x86_64-linux-gnu/libomp.so.5

Change Toolchain to System.

Add -g to CMakeList.txt to get additional debug information.

set(CMAKE_CXX_FLAGS -g) set(CMAKE_C_FLAGS -g) set(CMAKE_Fortran_FLAGS -g)

https://www.quantum-espresso.org/Doc/user_guide/

library configure: https://www.quantum-espresso.org/Doc/user_guide/node11.html

test

In directory /q-e/test-suite/, use make run-tests to test the correctness of basic functionalities.

run

spack load ucx/gji

/home/qe/q-e/bin/pw.x

To control the number of processors in each group, command line switches: -nimage, -npools, -nband, -ntg, -ndiag or -northo (shorthands, respectively: -ni, -nk, -nb, -nt, -nd) are used. As an example consider the following command line: mpirun -np 4096 ./neb.x -ni 8 -nk 2 -nt 4 -nd 144 -i my.input This executes a NEB calculation on 4096 processors, 8 images (points in the configuration space in this case) at the same time, each of which is distributed across 512 processors. k-points are distributed across 2 pools of 256 processors each, 3D FFT is performed using 4 task groups (64 processors each, so the 3D real-space grid is cut into 64 slices), and the diagonalization of the subspace Hamiltonian is distributed to a square grid of 144 processors (12x12).

mpirun -np 24 -x PATH --oversubscribe -x OMP_NUM_THREADS=4 -x LD_LIBRARY_PATH=/opt/nonspack/ucx-1.10.0-gcc/lib --allow-run-as-root /home/qe/q-e/bin/pw.x < ./ausurf.in

First run with 24 processes and 4 thread each:

Problem: OMP threads can only use up to 200% CPU per process even with 256 threads per process.

Analyze

Static Analysis

Using lizard

Fortran:

Total nloc Avg.NLOC AvgCCN Avg.token Fun Cnt Warning cnt Fun Rt nloc Rt 599949 54.1 10.6 569.7 9939 1693 0.17 0.58

C:

Total nloc Avg.NLOC AvgCCN Avg.token Fun Cnt Warning cnt Fun Rt nloc Rt 52039 152.5 3.0 1050.3 323 19 0.06 0.53

Python:

Total nloc Avg.NLOC AvgCCN Avg.token Fun Cnt Warning cnt Fun Rt nloc Rt 8864 18.3 5.0 146.0 298 21 0.07 0.26

Profiling result

All the GPU versionn test case seems to have IEEE underflow, trigger by the FFTlib, which should be fixed. Since the developing team of this project still aggressively develop the application to tailor to GPU.

We chose to use a case called si.scf.david.in to profile on single GPU. And here's the profiling result.

=117847== Profiling application: /home/qe/bin/pw.x -i ./si.scf.david.in ==117847== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 8.72% 22.118ms 140 157.98us 157.82us 159.81us usnldiag_collinear_79_gpu 6.46% 16.390ms 1360 12.051us 11.840us 189.41us init_us_2_base_gpu_216_gpu 5.29% 13.411ms 10 1.3411ms 1.3407ms 1.3417ms rotate_wfc_k_gpu_146_gpu 4.24% 10.763ms 370 29.090us 28.704us 32.928us ylmr2_gpum_ylmr2_gpu_kernel_ 3.71% 9.4250ms 1127 8.3620us 6.5280us 17.664us volta_zgemm_32x32_nn 3.23% 8.1880ms 1224 6.6890us 6.5920us 7.1040us init_us_2_base_gpu_220_gpu 2.68% 6.8026ms 680 10.003us 9.8560us 10.784us init_us_2_base_gpu_185_gpu 2.67% 6.7818ms 340 19.946us 19.744us 21.280us init_us_2_base_gpu_206_gpu 2.61% 6.6090ms 340 19.438us 19.295us 21.504us init_us_2_base_gpu_158_gpu 2.46% 6.2396ms 689 9.0560us 7.2000us 14.432us void zgemm_largek_warp<bool=1, bool=0, bool=1, bool=0, int=3, int=3, int=4, int=3, int=2, int=2, int=9>(double2*, double2 const *, double2 const *, int, int, int, int, int, int, double2 const *, double2 const *, double2, double2, int, int, int*, int*) 2.28% 5.7953ms 159 36.448us 19.392us 43.200us cegterg_gpu_493_gpu 2.20% 5.5704ms 1104 5.0450us 4.1600us 11.488us void composite_2way_fft<unsigned int=20, unsigned int=4, unsigned int=32, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=5, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) 2.17% 5.4956ms 478 11.497us 11.359us 12.864us add_vuspsi_k_gpu_242_gpu 1.98% 5.0265ms 239 21.031us 10.208us 40.384us vloc_psi_k_gpu_464_gpu 1.86% 4.7254ms 219 21.577us 12.319us 33.824us void sytd2_upper_cta<double2, double, int=4>(int, double2*, unsigned long, double*, double*, double2*) 1.71% 4.3307ms 219 19.774us 19.743us 20.960us laxlib_cdiaghg_gpu_349_gpu 1.64% 4.1660ms 239 17.430us 17.248us 19.488us vloc_psi_k_gpu_477_gpu 1.48% 3.7585ms 1 3.7585ms 3.7585ms 3.7585ms force_corr_gpu_103_gpu 1.45% 3.6914ms 239 15.444us 15.264us 16.704us vloc_psi_k_gpu_456_gpu 1.40% 3.5579ms 2320 1.5330us 1.4080us 13.056us [CUDA memcpy DtoH] 1.36% 3.4570ms 219 15.785us 15.712us 16.352us laxlib_cdiaghg_gpu_317_gpu 1.34% 3.4099ms 159 21.445us 21.280us 23.136us g_psi_gpu_53_gpu 1.28% 3.2424ms 1979 1.6380us 1.2160us 13.120us [CUDA memcpy HtoD] 1.22% 3.0915ms 552 5.6000us 4.2880us 9.0560us void composite_2way_fft<unsigned int=20, unsigned int=4, unsigned int=16, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=5, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) 1.19% 3.0239ms 239 12.652us 10.816us 14.240us h_psi__gpu_158_gpu 1.14% 2.8893ms 219 13.193us 9.2160us 20.192us void trsm_ln_up_kernel<double2, unsigned int=32, unsigned int=32, unsigned int=4, bool=0>(int, int, double2 const *, int, double2*, int, double2, double2 const *, int, int*) 1.12% 2.8463ms 1095 2.5990us 2.4960us 3.2640us copy_info_kernel(int, int*) 1.06% 2.6975ms 170 15.867us 15.647us 16.544us init_us_2_base_gpu_119_gpu 1.02% 2.5845ms 40 64.612us 64.320us 72.960us stres_us_k_gpu_702_gpu 1.01% 2.5699ms 159 16.162us 16.096us 16.704us reorder_evals_cevecs_707_gpu 0.99% 2.5005ms 40 62.512us 62.240us 70.656us stres_us_k_gpu_817_gpu 0.97% 2.4644ms 159 15.499us 15.232us 16.576us cegterg_gpu_427_gpu 0.96% 2.4360ms 70 34.799us 34.720us 35.424us cegterg_gpu_265_gpu 0.89% 2.2453ms 40 56.131us 55.840us 63.040us stres_knl_gpu_100_gpu 0.86% 2.1855ms 40 54.636us 54.463us 56.832us stres_us_k_gpu_543_gpu 0.82% 2.0773ms 243 8.5480us 7.2320us 11.904us fft_scalar_cufft_cfft3d_gpu_586_gpu 0.82% 2.0749ms 280 7.4100us 7.3280us 7.8720us get_rho_gpu_954_gpu 0.80% 2.0350ms 212 9.5990us 9.4080us 10.016us dp_dev_memcpy_c2d_770_gpu 0.71% 1.7922ms 689 2.6010us 2.4960us 3.7440us void scal_kernel<double2, double2, int=1, bool=1, int=5, int=4, int=4, int=4>(cublasTransposeParams<double2>, double2 const *, double2*, double2 const *) 0.70% 1.7640ms 159 11.094us 10.912us 11.744us cegterg_gpu_376_gpu 0.67% 1.7032ms 508 3.3520us 3.1670us 4.4480us void reduce_1Block_kernel<double, int=128, int=7, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>>(double const *, double, double, int, double const *, double, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>, cublasPointerMode_t, cublasLtEpilogue_t, cublasGemvTensorStridedBatched<biasType<cublasGemvTensorStridedBatched<double>::value_type, double>::type const >) 0.67% 1.7000ms 508 3.3460us 3.1680us 4.8640us void dot_kernel<double, int=128, int=0, cublasDotParams<cublasGemvTensor<double const >, cublasGemvTensorStridedBatched<double>>>(double const ) 0.66% 1.6738ms 40 41.843us 41.760us 42.944us stres_us_k_gpu_617_gpu 0.66% 1.6658ms 159 10.476us 10.432us 11.136us reorder_evals_cevecs_700_gpu 0.54% 1.3789ms 219 6.2960us 5.1840us 8.9280us void potrf_alg2_cta_upper<double2, double, int=32>(int, int, double2*, unsigned long, int*) 0.53% 1.3506ms 170 7.9440us 7.8400us 8.6080us init_us_2_base_gpu_134_gpu 0.53% 1.3341ms 438 3.0450us 2.4960us 188.80us void lapack_identity_kernel<double, int=8>(int, int, double*, int) 0.52% 1.3279ms 219 6.0630us 5.0880us 8.6400us void trsm_right_kernel<double2, int=256, int=4, bool=0, bool=0, bool=0, bool=1, bool=0>(cublasTrsmParams<double2>, double2, double2 const *, int) 0.52% 1.3185ms 219 6.0200us 4.3200us 8.2880us void ormql_cta_kernel<double2, int=4, int=1>(int, int, int, double2 const *, unsigned long, double2 const *, double2*, unsigned long, int, int, int, int) 0.52% 1.3185ms 90 14.649us 14.496us 15.072us dylmr2_gpu_78_gpu 0.51% 1.2925ms 209 6.1840us 6.1440us 6.4640us dp_dev_memcpy_r1d_270_gpu 0.50% 1.2803ms 71 18.033us 17.983us 18.687us cegterg_gpu_615_gpu 0.50% 1.2592ms 438 2.8740us 2.7200us 3.8720us void kernel_extract_uplo_A<double2, int=5, int=3>(int, double2 const *, unsigned long, double2*, unsigned long, int) 0.50% 1.2586ms 163 7.7210us 7.5840us 8.0000us dp_dev_memset_c2d_1851_gpu 0.47% 1.1830ms 408 2.8990us 2.4960us 3.7440us __pgi_dev_cumemset_16n 0.47% 1.1818ms 80 14.772us 14.496us 17.216us g2_kin_gpu_40_gpu 0.44% 1.1150ms 169 6.5970us 5.6960us 9.1200us void trsm_left_kernel<double2, int=256, int=4, bool=0, bool=1, bool=1, bool=1, bool=0>(cublasTrsmParams<double2>, double2, double2 const *, int) 0.42% 1.0619ms 52 20.420us 18.944us 27.136us volta_zgemm_32x32_cn 0.42% 1.0610ms 70 15.157us 15.104us 16.032us sum_band_k_gpu_837_gpu 0.40% 1.0224ms 219 4.6680us 4.2240us 5.4720us void lansy_M_stage1<double2, double, int=8>(int, double2 const *, unsigned long, double*, int) 0.40% 1.0046ms 90 11.162us 11.040us 11.488us dylmr2_gpu_90_gpu 0.39% 984.57us 80 12.307us 12.223us 12.928us atomic_wfc___gpu_396_gpu 0.37% 946.72us 80 11.833us 11.744us 12.224us compute_deff_gpu_41_gpu 0.36% 909.82us 689 1.3200us 1.2480us 2.0160us [CUDA memset] 0.34% 856.35us 219 3.9100us 3.8080us 5.6000us void batch_symmetrize_kernel<double2, int=5, int=3>(int, double2*, unsigned long, __int64, int, int) 0.34% 855.00us 30 28.500us 28.352us 29.568us gen_us_dy_gpu_229_gpu 0.33% 842.37us 90 9.3590us 9.2480us 9.8240us dylmr2_gpu_101_gpu 0.33% 827.00us 90 9.1880us 9.0230us 10.048us dylmr2_gpu_60_gpu 0.30% 772.22us 219 3.5260us 3.4870us 4.8000us void lansy_M_stage2<double, int=8>(int, double*) 0.29% 745.95us 30 24.865us 24.831us 25.120us gen_us_dy_gpu_198_gpu 0.28% 703.80us 30 23.460us 23.423us 24.128us gen_us_dy_gpu_146_gpu 0.27% 690.78us 219 3.1540us 3.0720us 3.7120us void lapack_lacpy_kernel<double, int=8>(int, int, double const *, int, double*, int, int, int) 0.27% 685.82us 219 3.1310us 3.0390us 3.6480us void laed0_phase1_kernel<double, int=8>(int, double const *, int, int const *, double*, int, int, int) 0.25% 644.64us 219 2.9430us 2.8800us 3.9040us void stedcx_convert_kernel<double2, double, int=8>(int, int, double const *, int, double2*, int) 0.25% 642.30us 219 2.9320us 2.8800us 3.2960us void lacpy_kernel<double2, double2, int=5, int=3>(int, int, double2 const *, unsigned long, double2*, unsigned long, int, int) 0.25% 623.36us 219 2.8460us 2.8150us 3.2000us potrf_alg2_reset_info(int*) 0.24% 598.37us 219 2.7320us 2.6880us 2.8800us dtrsv_init_up(int*, int) 0.24% 596.93us 219 2.7250us 2.6880us 3.2320us potrf_alg2_set_info(int, int, int*) 0.22% 558.62us 30 18.620us 18.432us 18.911us gen_us_dy_gpu_85_gpu 0.21% 525.28us 70 7.5030us 7.4560us 7.6160us diag_bands_k_693_gpu 0.18% 457.21us 30 15.240us 15.136us 15.968us force_us_gpu_104_gpu 0.18% 456.89us 50 9.1370us 8.9910us 14.144us void trsm_lt_up_kernel<double2, unsigned int=32, unsigned int=32, unsigned int=4, bool=0, bool=1>(int, int, double2 const *, int, double2*, int, double2, double2 const *, int, int*) 0.18% 454.24us 30 15.141us 15.040us 17.024us gen_us_dy_gpu_185_gpu 0.18% 453.47us 70 6.4780us 6.4320us 6.7520us dp_dev_memset_r2d_1431_gpu 0.17% 437.12us 20 21.856us 21.632us 23.712us atomic_wfc_gpu_108_gpu 0.17% 427.58us 20 21.379us 20.992us 23.104us interp_atwfc_gpu_30_gpu 0.15% 381.34us 30 12.711us 12.608us 13.184us gen_us_dy_gpu_102_gpu 0.14% 362.69us 60 6.0440us 5.9510us 6.2720us gen_us_dy_gpu_220_gpu 0.13% 334.53us 78 4.2880us 3.9040us 5.5360us void gemv2N_kernel<int, int, double2, double2, double2, double2, int=128, int=16, int=4, int=4, int=1, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>, double2>>(double2 const ) 0.12% 298.91us 1 298.91us 298.91us 298.91us compute_dvloc_gpum_compute_dvloc_gpu_ 0.10% 255.07us 10 25.507us 25.280us 27.392us gen_us_dj_gpu_206_gpu 0.10% 248.74us 10 24.873us 24.800us 25.216us gen_us_dj_gpu_173_gpu 0.10% 243.93us 10 24.393us 24.256us 25.440us gen_us_dj_gpu_119_gpu 0.08% 204.67us 30 6.8220us 6.7520us 6.9760us gen_us_dy_gpu_112_gpu 0.08% 198.24us 52 3.8120us 3.5520us 4.9280us void splitKreduce_kernel<double2, double2, double2, double2>(cublasSplitKParams<double2>, double2 const *, double2 const *, double2*, double2 const *, double2 const *, double2 const *) 0.08% 197.82us 52 3.8040us 3.6480us 4.7040us void gemvNSP_kernel<double2, double2, double2, double2, int=1, int=32, int=4, int=1024, cublasGemvParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>, double2>>(double2 const ) 0.08% 194.37us 10 19.436us 19.072us 20.832us init_wfc_gpu_295_gpu 0.07% 186.46us 10 18.646us 18.592us 18.816us gen_us_dj_gpu_73_gpu 0.07% 182.18us 10 18.217us 18.176us 18.399us stres_knl_gpu_84_gpu 0.07% 173.02us 20 8.6510us 8.6400us 8.8320us cegterg_gpu_288_gpu 0.07% 172.42us 20 8.6200us 8.5120us 9.0560us stres_us_gpu_131_gpu 0.07% 171.01us 10 17.100us 17.024us 17.376us atomic_wfc_gpu_70_gpu 0.06% 152.13us 10 15.212us 15.071us 16.384us gen_us_dj_gpu_160_gpu 0.05% 137.73us 50 2.7540us 2.7200us 2.9760us dtrsv_init(int*) 0.05% 135.39us 2 67.695us 64.959us 70.432us force_corr_gpu_124_gpu 0.05% 123.78us 20 6.1880us 5.8880us 6.7520us void gemv2T_kernel_val<int, int, double2, double2, double2, double2, int=128, int=16, int=4, int=4, bool=1, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>, double2>>(double2 const , double2, double2) 0.05% 120.93us 20 6.0460us 5.9520us 6.3680us gen_us_dj_gpu_197_gpu 0.04% 103.62us 10 10.361us 10.304us 10.848us stres_us_gpu_91_gpu 0.04% 96.448us 7 13.778us 13.568us 14.176us dfunct_gpum_newd_gpu_311_gpu 0.04% 94.400us 1 94.400us 94.400us 94.400us stres_ewa_gpu_155_gpu 0.03% 72.992us 10 7.2990us 7.1360us 8.4160us init_wfc_gpu_391_gpu 0.03% 72.800us 2 36.400us 34.432us 38.368us force_lc_gpu_119_gpu 0.03% 72.768us 1 72.768us 72.768us 72.768us stres_har_gpu_77_gpu 0.03% 69.888us 10 6.9880us 6.8480us 7.4240us atomic_wfc_gpu_85_gpu 0.03% 67.520us 1 67.520us 67.520us 67.520us stres_loc_gpu_155_gpu 0.02% 59.712us 10 5.9710us 5.8880us 6.2080us rotate_wfc_k_gpu_132_gpu 0.01% 24.384us 6 4.0640us 3.7760us 4.9600us void reduce_1Block_kernel<double2, int=64, int=6, cublasGemvTensorStridedBatched<double2>, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>>(double2 const *, double2, double2, int, double2 const *, double2, cublasGemvTensorStridedBatched<double2>, double2 const , cublasPointerMode_t, cublasLtEpilogue_t, cublasGemvTensorStridedBatched<biasType<double2 const value_type, double2>::type const >) 0.01% 24.224us 6 4.0370us 3.7760us 4.8960us void dot_kernel<double2, int=64, int=1, cublasDotParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>>>(double2 const ) 0.01% 21.568us 1 21.568us 21.568us 21.568us stres_loc_gpu_98_gpu 0.01% 15.264us 6 2.5440us 2.4640us 2.8160us __pgi_dev_cumemset_4n 0.00% 9.7280us 1 9.7280us 9.7280us 9.7280us dvloc_of_g_gpu_184_gpu API calls: 56.54% 877.99ms 1715 511.95us 489ns 409.99ms cudaFree 19.84% 308.14ms 900 342.37us 1.4400us 295.87ms cudaDeviceSynchronize 7.03% 109.13ms 20152 5.4150us 4.5100us 310.44us cudaLaunchKernel 4.31% 66.931ms 1542 43.405us 4.6000us 3.8148ms cudaMemcpy 2.19% 34.061ms 2479 13.739us 3.8100us 180.48us cudaMemcpyAsync 2.12% 32.959ms 2557 12.889us 4.6510us 239.27us cudaEventSynchronize 1.43% 22.244ms 20 1.1122ms 822.92us 2.3907ms cuDeviceTotalMem 1.11% 17.296ms 6645 2.6020us 749ns 186.38us cudaEventRecord 0.93% 14.380ms 1744 8.2450us 1.8290us 1.3001ms cudaMalloc 0.75% 11.621ms 1977 5.8780us 149ns 1.6835ms cuDeviceGetAttribute 0.57% 8.8800ms 20143 440ns 330ns 287.69us cudaDeviceGetAttribute 0.49% 7.6111ms 1656 4.5960us 4.0700us 31.689us cuLaunchKernel 0.33% 5.1501ms 10579 486ns 330ns 239.62us cudaGetDevice 0.29% 4.4656ms 6 744.27us 448.31us 2.1013ms cudaGetDeviceProperties 0.28% 4.4199ms 10835 407ns 150ns 2.2176ms cudaGetLastError 0.25% 3.8660ms 1384 2.7930us 1.8200us 8.4200us cudaStreamSynchronize 0.20% 3.1513ms 689 4.5730us 3.3890us 20.390us cudaMemsetAsync 0.19% 3.0171ms 2557 1.1790us 1.0100us 11.680us cudaEventElapsedTime 0.15% 2.3771ms 256 9.2850us 1.9900us 152.75us cudaSetDevice 0.15% 2.2786ms 1524 1.4950us 780ns 12.790us cudaEventQuery 0.14% 2.1870ms 145 15.083us 7.2200us 21.080us cudaMemcpy2D 0.11% 1.7847ms 147 12.140us 4.5000us 738.97us cudaMallocHost 0.11% 1.7611ms 2336 753ns 469ns 12.960us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags 0.09% 1.3806ms 20 69.028us 41.230us 387.93us cuDeviceGetName 0.09% 1.3584ms 133 10.213us 4.9500us 107.07us cudaMemcpyToSymbol 0.09% 1.3446ms 508 2.6460us 2.2900us 14.350us cudaFuncGetAttributes 0.05% 771.33us 146 5.2830us 3.7500us 20.409us cudaFreeHost 0.04% 625.29us 44 14.211us 1.3800us 205.11us cudaStreamCreate 0.02% 380.08us 552 688ns 510ns 3.6400us cudaStreamIsCapturing 0.02% 359.66us 44 8.1740us 3.8090us 92.571us cudaStreamDestroy 0.01% 195.34us 267 731ns 620ns 15.100us cudaEventCreate 0.01% 170.44us 562 303ns 200ns 1.2400us cuCtxPushCurrent 0.01% 158.23us 562 281ns 200ns 810ns cuCtxPopCurrent 0.01% 116.94us 146 800ns 480ns 2.9910us cudaPointerGetAttributes 0.00% 54.041us 90 600ns 460ns 2.8110us cudaEventCreateWithFlags 0.00% 40.090us 3 13.363us 2.4000us 32.530us cudaStreamCreateWithFlags 0.00% 20.707us 24 862ns 250ns 6.3000us cuDeviceGet 0.00% 18.040us 4 4.5100us 1.8300us 9.0200us cuDeviceGetPCIBusId 0.00% 17.489us 4 4.3720us 2.5690us 9.3200us cuInit 0.00% 16.104us 45 357ns 180ns 1.9900us cudaGetFuncBySymbol 0.00% 13.147us 8 1.6430us 1.3110us 3.2490us cudaEventDestroy 0.00% 5.2070us 20 260ns 150ns 580ns cuDeviceGetUuid 0.00% 3.3580us 7 479ns 230ns 940ns cuDeviceGetCount 0.00% 2.6790us 10 267ns 180ns 360ns cuCtxGetCurrent 0.00% 1.2700us 2 635ns 190ns 1.0800us cudaGetDeviceCount 0.00% 1.1300us 4 282ns 240ns 380ns cuDriverGetVersion 0.00% 920ns 5 184ns 170ns 200ns cuCtxGetDevice 0.00% 309ns 1 309ns 309ns 309ns cudaDriverGetVersion 0.00% 200ns 1 200ns 200ns 200ns cudaRuntimeGetVersion

Compile with ICC

Compiling with intel icc with fftw library.

spack load intel-oneapi-compilers@2021.1.2

spack load intel-parallel-studio@cluster-2020.2

spack load netlib-lapack@3.9.1/nbc

spack load openmpi@4.1.1/jip

./configure --prefix=/home/qe/fftw-3.3.9 F77=ifort CC=icc CFLAGS="-O3 -g -march=native" FFLAGS="-O3 -g" -enable-openmp

make -j 128 all

If the option -march=native is added in FFLAGS, ifort will throw an error

ifort: error #10106: Fatal error in /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/compiler/2021.1.2/linux/bin/intel64/../../bin/intel64/fortcom, terminated by segmentation violation

Tuning with different number of MPI processes and OpenMP threads on one node, 32 processes with 8 threads each got the best performance in testcase AUSURF112.

​ PWSCF : 37m 3.31s CPU 4m46.48s WALL

Compile with AOCC

spack load aocc@3.0.0/46t spack load amdfftw@3.0 spack load openmpi@4.1.1/nqq export F90=flang export F77=flang export FC=flang export CC=clang export CXX=clang++ ./configure --enable-parallel --enable-openmp CFLAGS="-O3 -g -march=znver2" FFLAGS="-O3 -g -march=znver2" FFT_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3.a /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_omp.a /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_threads.a" BLAS_LIBS=/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdblis-3.0-avcgn4ja67j4wz5euv6usv4rt2okvytg/lib/libblis-mt.a LAPACK_LIBS=/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdlibflame-3.0-6tev4j6setn6jmojmydlnz3qi4bn5qrs/lib/libflame.a MPI_LIBS="-L/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/openmpi-4.1.1-nqqearshseiwkncy5roqcqij5dieen3p/lib" DFLAGS="-D__FFTW3 -D__MPI" IFLAGS="-I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/include -I/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdblis-3.0-avcgn4ja67j4wz5euv6usv4rt2okvytg/include -I/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdlibflame-3.0-6tev4j6setn6jmojmydlnz3qi4bn5qrs/include -I/home/qe/q-e/include"

pitfall: qe configure does not recognize flang. Need to change F90=flang in make.inc manually.

This version cannot pass the test and AUSURF112 benchmark does not converge. (Errors may be brought by the libraries)

All done. ERROR: only 166 out of 221 tests passed. Failed tests in: /home/qe/q-e/test-suite/pw_b3lyp/ /home/qe/q-e/test-suite/pw_berry/ /home/qe/q-e/test-suite/pw_cluster/ /home/qe/q-e/test-suite/pw_electric/ /home/qe/q-e/test-suite/pw_lda+U/ /home/qe/q-e/test-suite/pw_lsda/ /home/qe/q-e/test-suite/pw_md/ /home/qe/q-e/test-suite/pw_metaGGA/ /home/qe/q-e/test-suite/pw_metal/ /home/qe/q-e/test-suite/pw_noncolin/ /home/qe/q-e/test-suite/pw_pawatom/ /home/qe/q-e/test-suite/pw_realspace/ /home/qe/q-e/test-suite/pw_relax/ /home/qe/q-e/test-suite/pw_scf/ /home/qe/q-e/test-suite/pw_spinorbit/ /home/qe/q-e/test-suite/pw_uspp/ /home/qe/q-e/test-suite/pw_vc-relax/ /home/qe/q-e/test-suite/pw_vdw/ /home/qe/q-e/test-suite/pw_workflow_relax_relax/ /home/qe/q-e/test-suite/pw_workflow_scf_dos/ /home/qe/q-e/test-suite/pw_workflow_vc-relax_dos/ /home/qe/q-e/test-suite/pw_workflow_vc-relax_scf/ starting charge 1230.69946, renormalised to 1232.00000 negative rho (up, down): 3.043E+00 0.000E+00 Starting wfcs are 1008 randomized atomic wfcs [epyc.node1:216922] 127 more processes have sent help message help-btl-vader.txt / xpmem-make-failed [epyc.node1:216922] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [epyc.node1:216922] 127 more processes have sent help message help-btl-vader.txt / knem permission denied total cpu time spent up to now is 22.9 secs Self-consistent Calculation iteration # 1 ecut= 25.00 Ry beta= 0.70 Davidson diagonalization with overlap ethr = 1.00E-02, avg # of iterations = 5.0 Threshold (ethr) on eigenvalues was too large: Diagonalizing with lowered threshold Davidson diagonalization with overlap ethr = 4.37E-04, avg # of iterations = 18.5 negative rho (up, down): 2.992E+00 0.000E+00 total cpu time spent up to now is 430.1 secs total energy = -11423.48971757 Ry estimated scf accuracy < 6.31636318 Ry iteration # 2 ecut= 25.00 Ry beta= 0.70 Davidson diagonalization with overlap ethr = 5.13E-04, avg # of iterations = 15.5 negative rho (up, down): 2.993E+00 0.000E+00 total cpu time spent up to now is 795.7 secs total energy = -11408.37987998 Ry estimated scf accuracy < 196.19698446 Ry End of self-consistent calculation convergence NOT achieved after 2 iterations: stopping Writing output data file ./ausurf.save/ [epyc:216930:0:216930] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc7000) ==== backtrace (tid: 216930) ==== 0 /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.3.0/ucx-1.10.1-xby34b5gbwxi5cknbevj4wlbs34hyri6/lib/libucs.so.0(ucs_handle_error+0x254) [0x7fd0b3b587d4] 1 /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.3.0/ucx-1.10.1-xby34b5gbwxi5cknbevj4wlbs34hyri6/lib/libucs.so.0(+0x269b7) [0x7fd0b3b589b7] 2 /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.3.0/ucx-1.10.1-xby34b5gbwxi5cknbevj4wlbs34hyri6/lib/libucs.so.0(+0x26c8e) [0x7fd0b3b58c8e] 3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fd0b4180730] 4 /home/qe/q-e/bin/pw.x() [0x11e3890] 5 /home/qe/q-e/bin/pw.x() [0x11e3e47] 6 /home/qe/q-e/bin/pw.x() [0x11ef0ce] 7 /home/qe/q-e/bin/pw.x() [0x117a124] 8 /home/qe/q-e/bin/pw.x() [0x9087e0] 9 /home/qe/q-e/bin/pw.x() [0x9085c7] 10 /home/qe/q-e/bin/pw.x() [0x9084f7] 11 /home/qe/q-e/bin/pw.x() [0x906c58] 12 /home/qe/q-e/bin/pw.x() [0x920797] 13 /home/qe/q-e/bin/pw.x() [0x682772] 14 /home/qe/q-e/bin/pw.x() [0x67ca67] 15 /home/qe/q-e/bin/pw.x() [0x6a889f] 16 /home/qe/q-e/bin/pw.x() [0x4c8406] 17 /home/qe/q-e/bin/pw.x() [0x18baa23] 18 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fd0b3fd109b] 19 /home/qe/q-e/bin/pw.x() [0x4c81da] ================================= -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node epyc exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------

Compile with GCC

Specify the mkl libraries manually.

spack load gcc@10.2.0/3xz spack load openmpi@4.1.1/n46 ./configure --enable-parallel --with-scalapack=yes --enable-openmp CFLAGS="-O3 -g -march=znver2" FFLAGS="-O3 -g -march=znver2 -fallow-argument-mismatch" FFT_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3.a \ /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_omp.a \ /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_threads.a" \ BLAS_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_gf_lp64.a \ /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_sequential.a \ /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_core.a" \ LAPACK_LIBS=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_lapack95_lp64.a \ SCALAPACK_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_scalapack_ilp64.a \ /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_blacs_openmpi_lp64.a" \ MPI_LIBS="-L/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/openmpi-4.1.1-n46i3ctamj3tnmnd7qfzhabdweajbgsn/lib" \ DFLAGS="-D__FFTW3 -D__MPI -D__SCALAPACK" \ IFLAGS="-I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/include -I/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdblis-3.0-avcgn4ja67j4wz5euv6usv4rt2okvytg/include -I/home/qe/q-e/include"

Error to be fixed:

/usr/bin/ld: /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_core.a(mkl_memory_patched.o): undefined reference to symbol 'dlclose@@GLIBC_2.2.5' /usr/bin/ld: //lib/x86_64-linux-gnu/libdl.so.2: error adding symbols: DSO missing from command line collect2: error: ld returned 1 exit status

Misc

The library used in Q-E compiled by intel compiler:

BLAS_LIBS= -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

SCALAPACK_LIBS=-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64

FFT_LIBS= fftw-3.3.9

init_run : 158.19s CPU 21.00s WALL ( 1 calls) electrons : 2063.54s CPU 264.73s WALL ( 1 calls) Called by init_run: wfcinit : 148.40s CPU 19.08s WALL ( 1 calls) potinit : 1.84s CPU 0.24s WALL ( 1 calls) hinit0 : 2.63s CPU 0.50s WALL ( 1 calls) Called by electrons: c_bands : 1937.22s CPU 247.62s WALL ( 3 calls) sum_band : 116.01s CPU 15.64s WALL ( 3 calls) v_of_rho : 2.32s CPU 0.30s WALL ( 3 calls) newd : 12.90s CPU 1.87s WALL ( 3 calls) mix_rho : 0.29s CPU 0.04s WALL ( 3 calls) Called by c_bands: init_us_2 : 1.41s CPU 0.29s WALL ( 14 calls) cegterg : 1931.14s CPU 246.85s WALL ( 6 calls) Called by *egterg: cdiaghg : 304.65s CPU 38.94s WALL ( 81 calls) h_psi : 656.99s CPU 84.10s WALL ( 85 calls) s_psi : 145.97s CPU 18.38s WALL ( 85 calls) g_psi : 0.31s CPU 0.05s WALL ( 77 calls) Called by h_psi: h_psi:calbec : 183.87s CPU 23.70s WALL ( 85 calls) vloc_psi : 321.07s CPU 41.10s WALL ( 85 calls) add_vuspsi : 150.67s CPU 19.07s WALL ( 85 calls) General routines calbec : 232.51s CPU 30.03s WALL ( 91 calls) fft : 3.38s CPU 0.44s WALL ( 40 calls) ffts : 0.93s CPU 0.15s WALL ( 6 calls) fftw : 348.65s CPU 44.30s WALL ( 37782 calls) interpolate : 0.26s CPU 0.03s WALL ( 3 calls) davcio : 0.04s CPU 0.27s WALL ( 6 calls)

compiler option --march=native has no significant effect on speed

Try to run on two nodes, but failed

spack load intel-parallel-studio@cluster-2020.2

spack load openmpi@4.1.1/jip

spack load ucx/gji

mpirun --prefix /opt/spack/opt/spack/linux-debian10-zen2/intel-2021.1.2/openmpi-4.1.1-jipfb67ngxddcblg4rcsjuu47pskabrs/ -np 64 -hostfile ./hostfile -mca pml ucx -x UCX_TLS=rc_x,sm,self -x UCX_NET_DEVICES=mlx5_0:1 -x PATH -x LD_LIBRARY_PATH --oversubscribe /home/qe/q-e/bin/pw.x < ./ausurf.in

Set up the remote node when login non-interactively

add to .bashrc

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nonspack/ucx-1.10.0-gcc/lib . /opt/spack/share/spack/setup-env.sh spack load intel-parallel-studio@cluster-2020.2 spack load openmpi@4.1.1/jip spack load ucx/gji

A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find.

Host: epyc.node2 Framework: pml Component: ucx

Arm Forge MAP Result

Original code compiled by intel compiler with mkl. testcase AUSURF112.

Profiling : /home/qe/q-e/bin/pw.x -i ./ausurf.in Allinea sampler : preload MPI implementation : Auto-Detect (Open MPI) * MPI arguments * number of processes : 32 * number of nodes : 1 * Allinea MPI wrapper : preload (precompiled) Input file : <stdin> Working directory : /home/qe/benchmarks/sb/AUSURF112 Number of OpenMP threads : 8 Queue enabled : No System config file : /home/qe/.allinea/system.config OMP_NUM_THREADS (env var) : 8 Full target path : /home/qe/q-e/PW/src/pw.x Launched from host : epyc.node1 Run started : Sat Aug 28 07:04:24 2021 Sampling started : Sat Aug 28 07:04:24 2021 Sampling stopped : Sat Aug 28 07:09:39 2021 Runtime : 354s Sampled runtime : 315s

CPU floating-point: 38.2%

CPU memory access: 15.9%

CPU fp vector: 38.0%

CPU branch: 7.4%

Memory usage: 676MB

pcegterg_IP_ functions took a lot of time in synchronization mpi_barrier which is even greater than the actual calculating time.

Compile Option

NVHPC

# LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/nvhpc-21.5-qrsvxrpkmqhxy2coxes2qzcfhirsy5uv/Linux_x86_64/21.5/comm_libs/openmpi4/openmpi-4.0.5/lib spack load nvhpc@21.5/djb spack load /tyv #hdf5

OneAPI

LD_LIBRARY_PATH=/opt/intel/oneapi/vpl/2021.4.0/lib:/opt/intel/oneapi/tbb/2021.3.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.3.1//libfabric/lib:/opt/intel/oneapi/mpi/2021.3.1//lib/release:/opt/intel/oneapi/mpi/2021.3.1//lib:/opt/intel/oneapi/mkl/2021.3.0/lib/intel64:/opt/intel/oneapi/itac/2021.3.0/slib:/opt/intel/oneapi/ipp/2021.3.0/lib/intel64:/opt/intel/oneapi/ippcp/2021.3.0/lib/intel64:/opt/intel/oneapi/ipp/2021.3.0/lib/intel64:/opt/intel/oneapi/dnnl/2021.3.0/cpu_dpcpp_gpu_dpcpp/lib:/opt/intel/oneapi/debugger/10.1.2/gdb/intel64/lib:/opt/intel/oneapi/debugger/10.1.2/libipt/intel64/lib:/opt/intel/oneapi/debugger/10.1.2/dep/lib:/opt/intel/oneapi/dal/2021.3.0/lib/intel64:/opt/intel/oneapi/compiler/2021.3.0/linux/lib:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/x64:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/emu:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/oclfpga/linux64/lib:/opt/intel/oneapi/compiler/2021.3.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/ccl/2021.3.0/lib/cpu_gpu_dpcpp:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/nvhpc-21.5-qrsvxrpkmqhxy2coxes2qzcfhirsy5uv/Linux_x86_64/21.5/compilers/lib:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/openssl-1.1.1k-v735mywfwhu5wwrc6rcppju7lxvoxegh/lib:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/zlib-1.2.11-aim3z46oucbopx4jmsvi6rj23psecql5/lib:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/ncurses-6.2-zdp3gdfsnlvphj7kpsgsfk3jvtxvuvz7/lib:/opt/intel/oneapi/mpi/2021.3.1//lib/release/

pitfalls

  1. https://github.com/MPAS-Dev/MPAS-Model/issues/554
  2. https://forums.developer.nvidia.com/t/problem-with-nvfortran-and-r/155366
  3. LibGOMP not IMPLEMENTED: fftw/scalapack/hdf5/elpa is not dependent on the compiler's lib.

performance

#if defined(__GPU_MPI) ierr = cudaDeviceSynchronize() ! This syncs __GPU_MPI case CALL bcast_integer_gpu( msg_d, msglen, source, group ) RETURN ! Sync done by MPI call (or inside bcast_xxx_gpu)```bash nvfortran 21.2-0 LLVM 64-bit target on x86-64 Linux -tp zen NVIDIA Compilers and Tools Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

1. GPU single thread ```bash real 1m51.316s user 51m9.972s sys 4m59.190s
  1. GPU 4 thread
real 1m34.486s user 2m12.550s
  1. 4 GPU 4 threads
real 6m26.432s user 4h20m2.947s sys 4h24.789s
  1. 8 GPU 2 node 4 threads
real 4m42.563s user 1h24m6.227s sys 2h0m4.267s

MPI + Cuda seems to call diffent routines of GPU implementation, which communication always hold the bounds.

#pragma acc host_data use_device(s_buf) MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); ... #pragma acc update host(s_buf[0:size] ) MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

So we are going to try GPU direct MPI.

#if defined(__GPU_MPI) ierr = cudaDeviceSynchronize() ! This syncs __GPU_MPI case CALL bcast_integer_gpu( msg_d, msglen, source, group ) RETURN ! Sync done by MPI call (or inside bcast_xxx_gpu)

But CUBLAS and other GPU code is just fine for one thread.

#if defined(__CUDA) USE cudafor USE cublas #endif IMPLICIT NONE SAVE PRIVATE REAL(DP) :: one, zero, two, minus_one, minus_two PARAMETER ( one = 1.0d0, zero = 0.0d0, two = 2.0d0, minus_one = -1.0d0 ) PARAMETER ( minus_two = -2.0d0 ) COMPLEX(DP) :: cone, czero, mcone PARAMETER ( cone = (1.0d0, 0.0d0), czero = (0.0d0, 0.0d0) ) PARAMETER ( mcone = (-1.0d0, 0.0d0) ) REAL(DP) :: small = 1.0d-14 LOGICAL :: use_parallel_diag PUBLIC :: sigset PUBLIC :: tauset PUBLIC :: rhoset PUBLIC :: ortho_iterate PUBLIC :: updatc, calphi_bgrp PUBLIC :: mesure_diag_perf, mesure_mmul_perf PUBLIC :: use_parallel_diag PUBLIC :: bec_bgrp2ortho REAL(DP), ALLOCATABLE DEVICEATTR :: tmp1(:,:), tmp2(:,:), dd(:,:), tr1(:,:), tr2(:,:) REAL(DP), ALLOCATABLE DEVICEATTR :: con(:,:), x1(:,:) CONTAINS SUBROUTINE allocate_local_arrays(ldx) INTEGER, INTENT(IN) :: ldx IF( ALLOCATED( tr1 ) ) THEN IF( SIZE( tr1, 1 ) /= ldx ) THEN DEALLOCATE( tmp1, tmp2, dd, x1, con ) DEALLOCATE( tr1, tr2 ) END IF END IF IF( .NOT. ALLOCATED( tr1 ) ) THEN ALLOCATE( tr1(ldx,ldx), tr2(ldx,ldx) ) ALLOCATE( tmp1(ldx,ldx), tmp2(ldx,ldx), dd(ldx,ldx), x1(ldx,ldx), con(ldx,ldx) ) END IF END SUBROUTINE allocate_local_arrays SUBROUTINE deallocate_local_arrays() IF( ALLOCATED( tr1 ) ) DEALLOCATE( tr1 ) IF( ALLOCATED( tr2 ) ) DEALLOCATE( tr2 ) IF( ALLOCATED( tmp1 ) ) DEALLOCATE( tmp1 ) IF( ALLOCATED( tmp2 ) ) DEALLOCATE( tmp2 ) IF( ALLOCATED( dd ) ) DEALLOCATE( dd ) IF( ALLOCATED( x1 ) ) DEALLOCATE( x1 ) IF( ALLOCATED( con ) ) DEALLOCATE( con ) END SUBROUTINE deallocate_local_arrays SUBROUTINE clear_unused_elements( x, idesc ) ! ! Clear elements not involved in the orthogonalization ! IMPLICIT NONE REAL(DP) DEVICEATTR :: x(:,:) INTEGER, INTENT(IN) :: idesc(:) INTEGER :: nr, nc, i, j INCLUDE 'laxlib.fh' IF( idesc(LAX_DESC_ACTIVE_NODE) < 0 ) then x = 0.0d0 ELSE nr = idesc(LAX_DESC_NR) nc = idesc(LAX_DESC_NC) !$cuf kernel do(2) <<<*,*>>> do j = nc + 1, SIZE( x, 2 ) do i = 1, SIZE( x, 1 ) x( i, j ) = 0.0d0 end do end do !$cuf kernel do(2) <<<*,*>>> do j = 1, SIZE( x, 2 ) do i = nr + 1, SIZE( x, 1 ) x( i, j ) = 0.0d0 end do end do END IF END SUBROUTINE

ramBLe

turn off hyperthreading

sudo su echo off > /sys/devices/system/cpu/smt/control

/home/opc/ramBLe

boost 1.70.0 & mvapich2.3.3

turn off hyperthreading

sudo su echo off > /sys/devices/system/cpu/smt/control

Gdrive

wget https://github.com/prasmussen/gdrive/releases/download/2.1.1/gdrive_2.1.1_linux_amd64.tar.gz tar -zxvf gdrive_2.1.1_linux_amd64.tar.gz wget https://forensics.cert.org/cert-forensics-tools-release-el7.rpm sudo rpm -Uvh cert-forensics-tools-release*rpm sudo yum --enablerepo=forensics install musl-libc -y

init env values

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/nfs/cluster/boost_1_70_0/stage/lib source /home/opc/ramBLe/env.sh

mpi

source /opt/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-redhat7.9-x86_64/hpcx-init.sh hpcx_load
mpirun -np 4 --display-map --map-by node -x MXM_RDMA_PORTS=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1

run

mpirun -np 144 --display-map --hostfile hostfiles -x MXM_RDMA_PORTS=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 ./ramble -f test/coronary.csv -n 6 -m 1841 -d -o test/coronary.dot
[opc@inst-dahrf-splendid-walrus ramBLe]$ cat hostfiles hpc-node-1 slots=36 hpc-node-2 slots=36 hpc-node-3 slots=36 hpc-node-4 slots=36
tab is $'\t'

python run experinment

at /nfs/cluster/ramBle_hpcg

python common/scripts/ramble_experiments.py \ -p 16 -r 1 -a gs -d /nfs/scratch/C1_discretized.tsv -s '\t' -v \ --results result\c1.csv
mpirun -np 144 --display-map --hostfile hostfiles -x MXM_RDMA_PORTS=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 ./ramble -f /nfs/scratch/C1_discretized.tsv -m 29150 -n 5164 -s $'\t' -v -i -d -o test/c1.dot
mpirun -np 1 \ --display-map \ --hostfile hostfiles \ -x MXM_RDMA_PORTS=mlx5_0:1 \ -mca btl_openib_if_include mlx5_0:1 \ -x UCX_NET_DEVICES=mlx5_0:1 \ ./ramble -f /nfs/scratch/C1_discretized.tsv -s $'\t' \ -n 29150 -m 5164 \ -c -v -i -d -o test/c1_2.dot >> result/hp_1
mpirun -np 144 \ --hostfile hostfiles \ -x MXM_RDMA_PORTS=mlx5_0:1 \ -mca btl_openib_if_include mlx5_0:1 \ -x UCX_NET_DEVICES=mlx5_0:1 \ ./ramble -f test/coronary.csv -s ',' -n 6 -m 1841 -d -o test/coronary.dot

Auto Run script

murez/SC21_SCript/ramble

Gdrive

gdrive download 1UdrvrUPBQRjQafeOn5gHENz9wCrOrX-F # ramBLe_hpcx.tar.gz gdrive download 1QmW1RF6mvnepQ3hawMNK46MoDRNq8YGx # boost_1_70_0_compiled.tar.gz

Install

Lib

Boost

just add the code into SConstruct to tell scons where is the boost lib is.

libPaths.append("/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/boost-1.70.0-m4ttgcfqixwe22z5kz7bpp7mbqdspdbg/lib") cppPaths.append("/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/boost-1.70.0-m4ttgcfqixwe22z5kz7bpp7mbqdspdbg/include")

Cardioid

Repo: https://github.com/LLNL/cardioid

编译

如果使用 mfem,可能需要手动指定其路径。

自动编译

由于上游的 cardioid 安装包存在编译时会卡死的问题,因此需要手动修补安装文件。

首先运行 spack edit cardioid 命令,spack 将会启动文本编辑器。此后,在类 class Cardioid(CMakePackage) 的起始处加入以下内容

patch('https://gist.githubusercontent.com/KiruyaMomochi/cc4dfde7da51c3b11e45ab1079662693/raw/cardioid-cmake.patch', sha256='27e2b01a2a181d7364cf786f9da31193407b1aa9c20d0175965a3c772cc7378b')

此后使用 spack -d install -v cardioid 继续编译。

Spack 手动编译

以 fish shell 为例。

source /opt/spack/share/spack/setup-env.fish spack stage cardioid+cuda spack cd cardioid+cuda spack build-env cardioid+cuda fish

纯手动编译

TODO

问题解决

Seg Fault with jemalloc

Happens when -nd >= 4

SIGTERM after finishing the job with -np >= 60

Some issue in the openmpi@4.1.1/jip ​

Use the Intel MPI

spack load intel-oneapi-compilers@2021.1.2 ​ export F90=ifort export F77=ifort export FC=ifort export CC=icc ​ export LD_LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mkl/2021.2.0/lib/intel64:$LD_LIBRARY_PATH ​ export LD_LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib:$LD_LIBRARY_PATH ​ export LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib:$LIBRARY_PATH ​ export LD_LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib/release_mt:$LD_LIBRARY_PATH ​ export LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib/release_mt:$LIBRARY_PATH ​ export PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/bin:$PATH ​ export MPI_LIBS=-L/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/libc ​ ./configure --enable-parallel --with-scalapack=yes --enable-openmp CFLAGS="-march=core-avx2 -fma -ftz -fomit-frame-pointer -g" FFLAGS="-O3 -march=core-avx2 -align array64byte -fma -ftz -fomit-frame-pointer -g" SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -mkl=parallel -lifcore" IFLAGS="-I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mkl/2021.2.0/include -I/home/qe/q-e/include -I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/include"

​ This version is slower than before. ​ Observations about the testcase AUSURF112: Cannot utilize hyperthreading efficiently, with 128 processes bind to core with OMP=1, it is faster than any combinations of number of processes and OMP_NUM_THREADS that utilizes all the hyperthreads. ​

About fftw

When we specify FFT_LIBS to configure of quantum-espresso 6.6, fft related macro are not defined. If FFTW_INCLUDE is defined, __FFTW is defined. Changing to amdfftw does not influence the running time. ​

export FFT_LIBS=-L/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib ​ export FFTW_INCLUDE=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/include

调试 CMake 项目

To debug a CMake project:

  1. Find cmake command from spack's log
  2. Create a new directory to save build files
  3. cd to the path and run cmake command
  4. append --trace-source=[path to CMakeLists.txt] to the cmake command

Use message command to print a middle result. For more information see CMake doc.

Spack install 安装时间太久

使用 spack -d install -v [package name] 输出调试日志。

如果问题出现在 cmake 期间,可能是函数 interface_link_libraries 的问题。 该函数会递归的去寻找各个子项目的 include path,这时候相同的依赖会被多次 include。又因为 spack 环境中的 include path 很长,会生成超极长的include path(指数级),导致 cmake 卡死。

解决方式 1

在他递归生成时加入检查

foreach(lib ${libs}) list(FIND searched ${lib} lib_has_been_searched) #message(SEND_ERROR "+++ ${lib} ${lib_has_been_searched}") if (lib_has_been_searched EQUAL -1) get_recursive_list(recursive_val ${lib} ${prop} ${searched}) foreach(val ${retval}) if(NOT recursive_val) list(APPEND val ${recursive_val}) else() if (val IN_LIST recursive_val) #message("Duplicate val!") else() list(APPEND val ${recursive_val}) endif() endif() endforeach() endif() endforeach()

解决方式 2

使用以下补丁,在递归后删除重复项。

diff --git a/elec/CMakeLists.txt b/elec/CMakeLists.txt index 4a526cb..ca92d2d 100644 --- a/elec/CMakeLists.txt +++ b/elec/CMakeLists.txt @@ -271,7 +271,7 @@ function(get_recursive_list retvar target prop) list(APPEND searched ${target}) #message(SEND_ERROR "=== ${target} ${prop} ${searched}") - set(${retval} "") + set(retval "") get_property(propval TARGET ${target} PROPERTY ${prop} SET) if (propval) get_target_property(propval ${target} ${prop}) @@ -288,6 +288,10 @@ function(get_recursive_list retvar target prop) endif() endforeach() + if(NOT retval) + list(REMOVE_DUPLICATES retval) + endif() + set(${retvar} ${retval} PARENT_SCOPE) #message(SEND_ERROR "--- ${target} ${prop} ${retval}") endfunction()

无法连接上国际互联网

Set http proxy to 192.168.100.5:1082, or use

proxychains -q [command]

Config proxy for git

Set proxy

git config --global http.proxy http://192.168.100.5:1082

Unset proxy

git config --global --unset http.proxy

Rules

Mystery App

Install

# Load gcc and openmpi: module load gcc-9.2.0 module load mpi wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh # Accept license agreements and select the # right install location, probably not your # home directory since disk quota is limited. # # Activate conda, if you skipped it's auto initialization source /path/to/your/conda/install/bin/activate # Turn on the conda base environment conda activate # Install pytorch conda install pytorch cudatoolkit=11.3 -c pytorch # Install build dependencies for larcv3: conda install cmake hdf5 scikit-build # Install Tensorflow: pip install tensorflow # NOTE: if you don't install tensorflow, you need to pip install numpy! # Clone larcv and install it: git clone https://github.com/DeepLearnPhysics/larcv3.git cd larcv3 git submodule update --init python setup.py build -j 64 python setup.py install # Install mpi4py: pip install --force-reinstall mpi4py --no-cache-dir # Install horovod with tensorflow or if you want it with pytorch: pip install --force-reinstall horovod --no-cache-dir

參數

mode.optimizer.gradient_accumulation <= 1 mode.optimizer.learning_rate=123.456 mode.optimizer.name = "rmsprop" "adam" mode.weights_location -> load checkpoint mode.no_summary_images run.compute_mode = DPCPP #? data parallel Cpp intel MKL優化CPU gradient_accumulation.....: 1 conf['mode']['optimizer']['learning_rate'] = 10.**random.uniform(-3.5, -2.5) conf['mode']['optimizer']['loss_balance_scheme'] = random.choice(["none", "light", "focal"]) checkpoint_iteration........: 500
learning_rateloss_balance_scheme

SCC_21.yml

defaults: - _self_ - network: SCC_21 - framework: torch - mode: train - data: real data: downsample: 0 run: distributed: true iterations: 500 compute_mode: GPU aux_minibatch_size: ${run.minibatch_size} aux_iterations: 10 id: ??? precision: float32 profile: false output_dir: output/${framework.name}/${network.name}/${run.id}/ minibatch_size: 2 mode: optimizer: adam loss_balance_scheme: light

iotest

running number/iteration=minibatch/rankthroughput=all running numberruning timethroughput=running number/iteration×iterationiteration×average runing time=minibatch/rankaverage runing time=minbatchrank×(reading time + compute time)

SC 20

Rewind: https://victoryang00.cn/wordpress/2020/11/12/vscc20-%e6%80%bb%e7%bb%93/

Benchmark

这里放置HPC竞赛中 Benchmark 有关资料,文档部分用于对新生的培训。

.dat Specs

ASC20

HPCG benchmark input file Sandia National Laboratories; University of Tennessee, Knoxville 384 256 256 60

how to run

export PATH=/opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/ompi/bin:/opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2/:$PATH export LD_LIBRARY_PATH=/opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/ompi/lib:$LD_LIBRARY_PATH source /etc/profile.d/modules.sh module use /opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/modulefiles module load hpcx
mpirun --allow-run-as-root --hostfile host2_gpu4 --mca pml_base_verbose 100 --mca btl_base_verbose 100 --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 --mca orte_base_help_aggregate=0 -x xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80

SC21

HPCG benchmark input file Sandia National Laboratories; University of Tennessee, Knoxville 256 256 512 1800

how to run

see binder.sh

HPL .dat config file

ASC18

The following is the HPL .dat configuration file template from ASC18.

HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 6 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 67200 65280 62976 65280 96000 65280 38400 96000 102400 168960 153600 76800 142848 153600 142848 124416 96256 142848 124416 115200 110592 96256 Ns 1 # of NBs 384 768 384 768 1024 768 896 768 1024 512 384 640 768 896 960 1024 1152 1280 384 640 960 768 640 256 960 512 768 1152 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 2 1 2 1 Ps 1 2 2 4 Qs 16.0 threshold 1 # of panel fact 0 1 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 2 8 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 0 1 2 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 2 0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 0 DEPTHs (>=0) 1 SWAP (0=bin-exch,1=long,2=mix) 192 swapping threshold 1 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0)

ASC20

The following is the HPL .dat configuration file template from ASC20. Machine Spec : 8 Tesla V100

HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 6 device out (6=stdout,7=stderr,file) 2 # of problems sizes (N) 175104 178176 165888 168960 172032 175104 Ns 2 # of NBs 384 256 128 256 384 192 288 320 384 384 768 1024 768 896 768 1024 512 384 640 768 896 960 1024 1152 1280 384 640 960 768 640 256 960 512 768 1152 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 4 2 8 1 2 1 Ps 4 8 2 2 4 Qs 16.0 threshold 1 # of panel fact 0 1 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 2 8 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 0 1 2 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 2 0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 0 DEPTHs (>=0) 1 SWAP (0=bin-exch,1=long,2=mix) 192 swapping threshold 1 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0)

SC21

The following is the HPL .dat configuration file template from SC20. Machine Spec : 8 Tesla A100

HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 6 device out (6=stdout,7=stderr,file) 2 # of problems sizes (N) 346122 348122 352122 Ns 2 # of NBs 384 256 128 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 4 2 8 1 2 1 Ps 4 8 2 2 4 Qs 16.0 threshold 1 # of panel fact 0 1 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 2 8 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 0 1 2 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 2 0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 0 DEPTHs (>=0) 1 SWAP (0=bin-exch,1=long,2=mix) 192 swapping threshold 1 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0)

Binder

#!/bin/bash cd $1 # Global settings export UCX_RNDV_SCHEME=put_zcopy export UCX_IB_PCI_RELAXED_ORDERING=on export UCX_MEMTYPE_CACHE=n export UCX_MAX_RNDV_RAILS=1 export UCX_RNDV_THRESH=8192 APP="$2" me=`hostname` lrank=$OMPI_COMM_WORLD_LOCAL_RANK case ${lrank} in 0) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP" source ../source.sh export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 1) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP" export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 2) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 3) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 4) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 5) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 6) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 7) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 8) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP" source ../source.sh export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 9) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP" export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 10) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 11) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 12) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 13) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 14) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 15) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 16) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP" source ../source.sh export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 17) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP" export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 18) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 19) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 20) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 21) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 22) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 23) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 24) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP" source ../source.sh export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 25) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP" export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 26) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 27) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 28) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 29) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 30) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 31) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 32) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP" source ../source.sh export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 33) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP" export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 34) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 35) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 36) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 37) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 38) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 39) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 40) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP" source ../source.sh export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 41) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP" export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 42) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 43) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 44) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 45) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 46) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 47) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 48) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP" source ../source.sh export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 49) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP" export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 50) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 51) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 52) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 53) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 54) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 55) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 56) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP" source ../source.sh export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 57) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP" export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP ;; 58) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 59) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP" export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP ;; 60) #ldd $APP echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 61) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP" export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP ;; 62) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; 63) echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP" #set GPU and CPU affinity of local rank echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP" export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP ;; esac

DevOps

这里放置有关HPC环境维护有关的资料。

BeeGFS

BeeGFS is a hardware-independent POSIX parallel file system developed with a strong focus on performance and designed for ease of use, simple installation, and management.

Please have a look at BeeGFS Architecture overview before continuing.

System Architecture Overview: Parallelism and Scale-Out

ℹ️ Note: For linux kernels 5.x

Currently, the BeeGFS kernel module is not compatible with the Linux kernel 5.x. We need to patch it manually.

Some work has been done by Build kernel module against kernel version 5.8.x and tobydarling/beegfs-7.1.4-kernel-5.6.4.

Insallation

Please follow the Quick Start Guide to install.

Here we will only give you additional notes, assuming the operating system is Debian 10.

Step 1: Package Download and Installation

  1. Find the last version from BeeGFS Package Repository.
  2. Find the link to repository file, it should be something like:
    https://www.beegfs.io/release/beegfs_7.2.4/dists/beegfs-deb10.list
    where 7.2.4 is the version number, deb10 is the distribution name & version.
  3. Download and save the file to /etc/apt/sources.list.d/beegfs.list:
    curl -Lo /etc/apt/sources.list.d/beegfs.list <the download link>
  4. Update the package list:
    apt-get update
  5. Install the package from the repository. To avoid errors, you should only install the package you need. For example, you don't need to install beegfs-mgmtd if this machine is only a BeeGFS client.
    # only install the package you need! # management service apt-get install beegfs-mgmtd # metadata service; libbeegfs-ib is only required for RDMA apt install beegfs-meta libbeegfs-ib # storage service; libbeegfs-ib is only required for RDMA apt install install beegfs-storage libbeegfs-ib # client and command-line utils apt install beegfs-client beegfs-helperd beegfs-utils
  6. For your convenience, consider append beegfs binary path into PATH, which is /opt/beegfs/sbin/.

Step 2: Client Kernel Module Autobuild

Since we are using RDMA and installed InfiniBand kernel modules from Mellanox OFED, we should use buildArgs like this:

# /etc/beegfs/beegfs-client-autobuild.conf buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1 OFED_INCLUDE_PATH=/usr/src/ofa_kernel/default/include

Step 3: Basic Configuration

Please read the official guide carefully first, or you will waste a lot of time.
请先完整阅读 官方教程, 不然你会浪费很多时间。
請先完整閱讀 官方指南, 否則你會浪費很多時間。
公式ガイドをよく読んでからでないと、多くの時間を無駄にしてしまいます。

Assuming we use such configuration:

  • epyc.node1: management + metadata + storage + client
  • epyc.node2: storage + client

We also assume you have appended /opt/beegfs/sbin/ to PATH. Otherwise, you should use prepend this path to commands we used below.

Then on node1, the commands are:

# node1 # setup management service beegfs-setup-mgmtd -p /geekpie/beegfs_mgmtd # setup metadata service beegfs-setup-meta -p /geekpie/beegfs_meta -m epyc.node1 # setup storage service beegfs-setup-storage -p /geekpie/hpc/ -i 101 -m epyc.node1 # setup client beegfs-setup-client -m epyc.node1

On node2, the commands are:

# node2 # setup storage service beegfs-setup-storage -p /geekpie/hpc/ -i 201 -m epyc.node2 # setup client beegfs-setup-client -m epyc.node2

If you setuped more than once, please manually check configuration files since there may be some error.

Step 4: Service Setup

With the same assumption as above, we can start the services on node1 and node2:

# node1 # start services systemctl start beegfs-mgmtd beegfs-meta beegfs-storage beegfs-helperd beegfs-client
# node2 # start services systemctl start beegfs-storage beegfs-helperd beegfs-client

Step 5: Check Connectivity

We can check the connectivity using these commands:

beegfs-ctl --listnodes --nodetype=meta --nicdetails beegfs-ctl --listnodes --nodetype=storage --nicdetails beegfs-ctl --listnodes --nodetype=client --nicdetails beegfs-net # Displays connections the client is actually using beegfs-check-servers # Displays possible connectivity of the services beegfs-df # Displays free space and inodes of storage and metadata targets

Check configuration

You can check the configuration by inspecting the config files, these files are located at /etc/beegfs/.

Please notice that if you have setup BeeGFS twice, you may need to manually fix some configuration files, like beegfs-storage.conf.

Grafana

数据源:telegraf

PBS

PBS 全称为 Portable Batch System,可以用来控制多个计算机上的任务。

常见使用方式如下,qcmd 为任意 PBS 命令:

# Name for the job qcmd -N ramBLe_128 # Name of destination queue qcmd -q GeekPie_CPU # Required resources qcmd -l nodes=4:ppn=32:amd qcmd -l walltime=00:10:00 # Redirect stdout/stderr qcmd -o /public/home/geekpie2/ramble-amd/ramBLe/submit/pbs-com-single-${PBS_JOBID}.out qcmd -e /public/home/geekpie2/ramble-amd/ramBLe/submit/pbs-com-single-${PBS_JOBID}.err

常用命令

  • qsub: 提交任务或启动交互式 Shell
  • qstat: 查看任务状态
    • 如果需要显示详细信息,可以使用 -f 参数
    • 如果需要查看队列状态,可以使用 -Q 参数,后接队列名称
    • 例如: qstat -Qf GeekPie-CPU
  • qdel: 删除任务

常用参数

参数说明
-q队列、服务器,或服务器上的队列设置执行任务的主体
-N任务名称设置任务名称
-l资源列表,使用逗号分隔设置需要的资源,该命令可指定多次
-o输出文件stdout 内容将被重定向到该文件中,推荐使用绝对路径
-e错误文件stderr 内容将被重定向到该文件中,推荐使用绝对路径

参考文档

Slurm

本超算使用的是 Slurm,详细的配置可见 配合某戏精使用的 slurm 踩坑日记

Singularity

伯克利出品的一个用户态放 docker 的地方。

Kanidm

Kanidm is an identity management server. We use it to manage users across multiple nodes.

We have two groups: a posix group geekpie-hpc for everyone, and an admin group geekpie_admins.

geekpie_admins is used for manage accounts, which is a subgroup of:

  • idm_people_manage_priv: create new person
  • idm_group_write_priv: add person into a group
  • idm_account_unix_extend_priv: enable posix for a person
  • idm_account_write_priv: add ssh key to person

To begin with, export environment variable KANIDM_URL, and login with your geekpie_admins user.

export KANIDM_URL="https://hpc-idm.geekpie.icu:8443" kanidm login --name geekpie

Create a user

To create a user called John Smith, and add it to geekpie-hpc group:

kanidm person create jsmith "John Smith" kanidm person update jsmith --mail "jsmith@shanghaitech.edu.cn" # --legalname kanidm group add-members geekpie-hpc jsmith

Then enable posix, set ssh key and password.

# In kanidm uid is the same as gid. I recommend you to manually allocate a gid. # Please see https://github.com/geekpiehpc/AnsiblePlaybook/blob/main/group_vars/epyc.yml for old uids. kanidm person posix set jsmith --gidnumber 2345 # --shell /usr/bin/bash kanidm person ssh add-publickey jsmith id_rsa (cat ~/.ssh/id_rsa.pub) # Don't need this the user do not need sudo kanidm person posix set-password jsmith

Install

curl -L -o kanidm.deb https://github.com/kanidm/kanidm/releases/download/latest/kanidm_Ubuntu_22.04_1.1.0-beta.13-2023051108041ddac86_x86_64.deb curl -L -o kanidm_unixd.deb https://github.com/kanidm/kanidm/releases/download/latest/kanidm-unixd_Ubuntu_22.04_1.1.0-beta.13-2023051108091ddac86_x86_64.deb sudo dpkg -i kanidm.deb kanidm_unixd.deb

/etc/kanidm/unixd:

pam_allowed_login_groups = ["geekpie-hpc"] default_shell = "/usr/bin/bash" home_alias = "name" use_etc_skel = true uid_attr_map = "name" gid_attr_map = "name"

/etc/kanidm/config:

uri = "https://hpc-idm.geekpie.icu:8443" verify_ca = true verify_hostnames = true

Edit /usr/share/pam-configs/kanidm-unixd . Change priority to 0, otherwise you will be asked sudo password twice!

Restart services

sudo systemctl restart kanidm-unixd sudo systemctl restart kanidm-unixd-tasks.service

Setup PAM and nsswitch

PAM

# THIS DIRTY HACK AND IS ACTUALLY UPSTREAM PACKAGING PROBLEM sudo mv /etc/pam.d/kanidm-unixd /usr/share/pam-configs/ sudo pam-auth-update # check kanidm

For nssitwch, edit /etc/nsswitch.conf:

passwd: files systemd kanidm group: files systemd [SUCCESS=merge] kanidm

Then Add sudoers file

echo '%geekpie-hpc ALL=(ALL:ALL) ALL' | sudo EDITOR='tee -a' visudo /etc/sudoers.d/geekpie

Add ssh config by creating /etc/ssh/sshd_config.d/60-kanidm.conf:

AuthorizedKeysCommand /usr/bin/env kanidm_ssh_authorizedkeys %u AuthorizedKeysCommandUser nobody

Restart sshd service

sudo systemctl restart sshd.service

Oracle 集群采用 ansible 管理机器的开机

我们把上面的 ansible 自己魔改了一份放在 github

Oracle 被主办方自带高了一个 telegraf

我们用另一个端口和 binary 部署了一个 grafana 机器放实时机器信息,后来发现这东西延时只能当历史记录看,运维又老是掉SSD,Ceph over HDD 没有 replica 真不可靠,还是不弄了。

后台大致长这样:

ISC 所用到的机器密参

ISC21

新加坡国家超算中心 Niagara

ISC22

Niagara Thor Bridges

NSCC

Used in ISC21

Login is very old, which is E5-2690 with centos6. The file system is old Lustre without flock(), so you have to disable spack check. The schedule system is using OpenPBS>

$ cat outputfile.o Checking The CPU and Network lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz Stepping: 2 CPU MHz: 1200.000 BogoMIPS: 5187.61 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5 NUMA node1 CPU(s): 6-11 NUMA node2 CPU(s): 12-17 NUMA node3 CPU(s): 18-23 lspci | grep Mel 81:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] ====================================================================================== Resource Usage on 2020-04-25 10:08:30.888043: JobId: 9954616.wlm01 Project: 21120227 Exit Status: 0 NCPUs Requested: 1 NCPUs Used: 1 CPU Time Used: 00:00:00 Memory Requested: 100mb Memory Used: 0kb Vmem Used: 0kb Walltime requested: 00:10:00 Walltime Used: 00:00:00 Execution Nodes Used: (std1708:ncpus=1:mem=102400kb) ======================================================================================

DGX node is good for its hack on TDP of V100-16GB.

https://help.nscc.sg/wp-content/uploads/AI_System_QuickStart.pdf

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2794.907 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 BogoMIPS: 4390.10 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 51200K NUMA node0 CPU(s): 0-19,40-59 NUMA node1 CPU(s): 20-39,60-79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr s se sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4 _2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp _l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hl e avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mb m_local dtherm ida arat pln pts md_clear flush_l1d total used free shared buff/cache available Mem: 503 58 340 0 105 442 Swap: 0 0 0 OFED-internal-4.4-2.0.7: Ubuntu 18.04.2 LTS \n \l Linux dgx4105 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Filesystem Size Used Avail Use% Mounted on udev 252G 0 252G 0% /dev tmpfs 51G 3.2M 51G 1% /run /dev/sda2 440G 395G 22G 95% / tmpfs 252G 12K 252G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 252G 0 252G 0% /sys/fs/cgroup /dev/sda1 487M 6.1M 481M 2% /boot/efi /dev/sdb1 7.0T 4.9T 1.8T 74% /raid 192.168.160.101:/home 3.4P 2.1P 1.4P 61% /home [davidcho@nscc03 ~]$ cat !$ cat dgx4105.txt Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2794.907 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 BogoMIPS: 4390.10 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 51200K NUMA node0 CPU(s): 0-19,40-59 NUMA node1 CPU(s): 20-39,60-79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d total used free shared buff/cache available Mem: 503 58 340 0 105 442 Swap: 0 0 0 OFED-internal-4.4-2.0.7: Ubuntu 18.04.2 LTS \n \l Linux dgx4105 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Filesystem Size Used Avail Use% Mounted on udev 252G 0 252G 0% /dev tmpfs 51G 3.2M 51G 1% /run /dev/sda2 440G 395G 22G 95% / tmpfs 252G 12K 252G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 252G 0 252G 0% /sys/fs/cgroup /dev/sda1 487M 6.1M 481M 2% /boot/efi /dev/sdb1 7.0T 4.9T 1.8T 74% /raid 192.168.160.101:/home 3.4P 2.1P 1.4P 61% /home 192.168.156.29@o2ib,192.168.156.30@o2ib:/scratch 2.8P 1.8P 993T 65% /scratch tmpfs 51G 0 51G 0% /run/user/0 Sat May 23 06:15:29 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 | | N/A 35C P0 43W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ 06:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 85:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 86:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1) 05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 0c:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 84:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 8b:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 12.23.1020 node_guid: ec0d:9a03:00a4:bbde sys_image_guid: ec0d:9a03:00a4:bbde vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 251 port_lid: 1417 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_3 transport: InfiniBand (0) fw_ver: 12.23.1020 node_guid: ec0d:9a03:00aa:2960 sys_image_guid: ec0d:9a03:00aa:2960 vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 251 port_lid: 1419 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.23.1020 node_guid: ec0d:9a03:00aa:29b8 sys_image_guid: ec0d:9a03:00aa:29b8 vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 251 port_lid: 1416 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_2 transport: InfiniBand (0) fw_ver: 12.23.1020 node_guid: ec0d:9a03:00aa:2988 sys_image_guid: ec0d:9a03:00aa:2988 vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 251 port_lid: 1422 port_lmc: 0x00 link_layer: InfiniBand

If NSCC is used again in the competition, contact your NTU or NUS student, they have four unique logins nodes to log in to the DGX nodes.

Bridges

Thor

Reference

  1. Performance Characteristics of the BlueField-2 SmartNIC
  2. https://developer.nvidia.com/blog/offloading-and-isolating-data-center-workloads-with-bluefield-dpu/
  3. https://docs.nvidia.com/networking/display/BlueFieldSWv35011563/Virtual+Switch+on+BlueField+DPU

Niagara

Used in ISC21

Login is the same as training nodes, the opened part is only Cascade lake and Ice lake, since the HPC/HPCG/HPCC requires both CPU and GPU, plz affine tasks to those nodes, normally after gia1000.

$ ssh -Y lclmaoroph@niagara.scinet.utoronto.ca Warning: Permanently added 'niagara.scinet.utoronto.ca' (RSA) to the list of known hosts. Password: =============================================================================== SciNet welcomes you to the NIAGARA supercomputer. This is a Niagara login node. Use this node to develop and compile code, to run short tests, and to submit computations to the scheduler. Remember that /scratch is never backed-up. Documentation: https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart Support: support@scinet.utoronto.ca or niagara@computecanada.ca =============================================================================== lclmaoroph@nia-login06:~$

Filesystem is GPFS, an IBM initiated FS that do not have full POSIX support, with only eventual consistency. But it's real fast for write and can scale up to 50PB.

SCRATCH Area is CVFS, a temporary fast cache for SCRATCH scripts.

The node start up script requires have some problem, no echo "bla" in .bashrc but echo "bla" 1>&2. PBS requires the bash to hack who you are. Maybe you could hack the quota or raise root permission. We attempted to use middleware FS and use bashrc to mount on the alocated nodes and eventually make it happen.

Azure

这里放置有关 Azure 云服务器 运维的信息。

CycleCloud

Azure CycleCloud is a tool for deploying HPC clusters in Azure and managing their workloads.

For a detailed introduction to CycleCloud, see CycleCloud Introduction. For a step-by-step guide to using CycleCloud, see Create, customize and manage an HPC cluster in Azure with Azure CycleCloud.

虽然我不想说……但最好的方法还是先去看看官方文档,了解一下它的功能,还有他的模板怎么写……新概念可能比较多,可以一边操作一边学习。 但请注意他的文档不全,或者有些内容过时了。如果你想确认最新的 CycleCloud 行为,用它开一台机器,把上面 /opt/cycle 的东西下下来,看里面的代码,是最直接的方法了。

Introduction: ...So what is CycleCloud?

……想说请这个先讲点历史。

CycleCloud 本来属于 Cycle Computing,后来被微软收购了。在被微软收购以前,CycleCloud 是可以在很多平台上使用的,包括 Amazon Web Services, Google Compute Engine 甚至是公司内部集群。

它的作用就是帮你方便的管理一堆 HPC 资源,举个例子我现在想在 AWS 上开 15 台机器作为我的 HPC Cluster,正常情况下我可能是一台一台开,聪明点的人知道写个脚本,一次申请所有资源

不过就算这样你可能还要对每台机器做一些初始化,比如配置网卡、软件、用户等等。当然现在有一些高级工具比如 Cloud-Init 可以让你更方便的完成这些初始化工作,不过用它配置一堆软件还是显得有些吃力。 比较现代的解决方法是是用 Ansible,通过它和一个包含“节点内所有机器”的文件,它可以自动帮你去做各种奇奇怪怪的初始化工作(实际写起来有点像 GitHub Actions)。

还有一件事情就是你需要监控这些机器是否正常,如果不正常可能要把一些机器移除。此外可能也要根据工作负载动态调节(autoscaling)你的机器数量。这些云服务不一定会帮你做,虽然说动态调节会让人想到 k8s,但用 k8s 跑 HPC 可能现在还是有点要命吧?

说了这么多,CycleCloud 就是这么个工具,它帮你自动化的控制云上的 HPC 资源,让你只要点几下就可以建立一个好用、稳定的 HPC 集群。

……😢 实际上可能没有那么好用了,不过至少这应该是他们的愿景吧……。

Prerequisites: What do I need to know to learn it?

Template

CycleCloud 最重要的概念是模板,它是一个包含了一个集群(Cluster)所有软硬件需求的文件。CycleCloud 将根据这个文件创建集群,而你点加号看到的那些 scheduler, filesystem, etc. 都是模板,你甚至可以在 Azure GitHub 里找到这些模板。

模板使用的格式类似 ini 但又比 ini 高级。如果感觉难以下手的话,先去看看 toml 这个东西。

Cluster-init? Project?

你会留意到它文档里会提到 cluster-init 或者 project 这种东西。这两个是同义词。注意不要和 cloud-init 搞混了,这个与我们现在说的的无关。为了避免混淆,我们之后都用 project 这个词。

cloud-initproject 都是初始化用的,但 cloud-init 更底层,由 Azure 负责。project 则由 CycleCloud 负责。 换句话说,cloud-init 跑完以后,你的机器其实已经创建好了,这个时候在被 CycleCloud 做二次初始化,而二次初始化具体来说就是执行一系列的脚本和 Chef Cookbook。

在学习 CycleCloud 的同时你最好去了解一下 Chef Infra,所有的 Project 你剥开以后就是 Chef Cookbooks,而 Chef Cookbooks 和 Ansible Playbooks 差不多,都是用来初始化机器的。 学习 Chef 的时候你会发现你其实在学 Ruby……这也没有办法嘛。Ruby 的语法需要适应,尤其是之前没有接触过的话。另外尤其注意 CycleCloud 所使用的 Chef 版本(你可以在 /opt/cycle/ 里找到它的二进制文件),避免用了旧版没有的东西。

小心 CycleCloud Chef 所用的 Ruby 版本对一些 SSL 网站的支援存在问题,见 https://bugs.ruby-lang.org/issues/15594。如果要下载东西,可能要用 http 或者自己写个 Chef Resource,在其中调用 Ruby 的 ::URI.open,并且设置 ssl_verify_modeOpenSSL::SSL::VERIFY_NONE

Cloud-Init

CycleCloud 支援 Cloud—Init,但只支援一半。

当你想用 Mime Multi Part Archive 时,你会失败。 当你想用 Jinja Template 时,你会失败。 当你 Cluster 开到一半突然想改文件时,你会发现改完的东西好像不起作用……

更要命的是,当你加载 CycleCloud Dashboard 或者使用 CLI 列出所有集群时,它似乎会把所有的 cloud-init 文件也作为配置的一部分返回给你。 结果你会发现用了很多很长的 Cloud-Init 以后使用 CycleCloud 会变慢很多。

基于以上这些原因,建议使用 Include File 格式,并将真正的 cloud-init 文件放在其他服务器上。

举个例子,我们的 cloud-init 可以这样写

#include-once https://example.com/kyaru/base.yml https://example.com/kyaru/sb/head.yml

url 里的 cloud-init 你随便写,想用什么格式用什么格式。赢两次!

CycleCloud CLI

Install cli from About page of CycleCloud dashboard. image

Custom image

There are more images available than CycleCloud built-in images. And to make GPU work, we must use image with Generation 2 support.

Azure HPC VM Images contains all images currently available. Choose one you need, and use it's urn to specify it.

Note Some images are incompatible with CycleCloud's built-in templates. For example, you can't use microsoft-dsvm:ubuntu-hpc:2004: with Slurm Template. We have a custom template to solve it.

Note: Install on Windows

CycleCloud on Windows depends on cryptography package. So for install script to finish, we need to do this before running script

  1. Install OpenSSL choco install openssl -pre -y
  2. Append C:\Program Files\OpenSSL-Win64\lib; to environment variable LIB
  3. Append C:\Program Files\OpenSSL-Win64\include; to environment variable INCLUDE

SGX Explained

  • ifconfig ib0 192.168.*
  • change ~/.bashrc for (non-)header conditional setup.
if [ -f /dev/ ]; then #头节点 setup eth .xxx ib() .xxx fi
  • 100 机器 nfs
  • slurm 启动开始跑,机器 one by one启动,任务复用、
  • 脚本allocate 起停 可以轮流睡觉 轮流slurm
  • MIG 启动两套命令/ rmmod nvdia*
  • Prometheus + Grafana for all chassis

Performance

echo 2 > /proc/sys/vm/overcommit_memory ulimit -a ulimited echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor [ -f "/shared/opt/home/q-e" ] sudo mount 10.0.0.8:/mnt/exports/shared/home /shared/home ...

GeekPie Machine

机器简介

Currently, we are using the SuperMicro 4124GS-TNR server.

CPU: Epyc 7742

AMD claim that theoretical floating point performance can be calculated as: Double Precision theoretical Floating Point performance = #real_cores * 8 DP flop/clk * core frequency. For a 2 socket system =2 * 64 cores * 8 DP flops/ clk * 2.2 GHz=2252.8 Gflops. This includes counting FMA as two flops.

GPU

RDMA

a1:00.0 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] a1:00.1 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
  • Official Documention
  • IB 卡的通讯协议 https://www.rdmamojo.com/2013/06/01/which-queue-pair-type-to-use/
  • OpenMPI 使用 http://scc.ustc.edu.cn/zlsc/user_doc/html/mpi-application/mpi-application.html

RAID

e6:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230] (rev 11) (prog-if 01 [AHCI 1.0])

Official brief: https://www.marvell.com/content/dam/marvell/en/public-collateral/storage/marvell-storage-88se92xx-product-brief-2012-04.pdf

To configure the RAID controller, the easiest way is to press Ctrl+M during booting.

If you want to boot a system on RAID, please use Legacy mode. If you switched to UEFI only, you can't find the controller even if you change it back later. To solve it, see Supermicro FAQ Entry

Firmware

It's possible to flash firmware, see Marvell 9230 Firmware Updates and such. Our current firmware is 1070 (bios oprom version). If you want to flash another firmware, you might need to make a FreeDOS bootable disk.

Note: Do backup before flashing!

Many links to firmware or utilities are broken. Station Drivers may still work. Also refer Marvell 92xx A1 Firmware Image Repository, it have a full collection of firmware images.

You can find supermicro's firmware from official site but you can't download it. Try download from http://members.iinet.net.au/~michaeldd/.

NVMe

Installed with https://www.asus.com/us/Motherboards-Components/Motherboards/Accessories/HYPER-M-2-X16-CARD-V2/.

21:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] 22:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] 23:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] 24:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]

⚠️ The PCIE socket of the NVME card must be configured as 4x4x4x4 so as to be recognized by the system correctly.

The card may have problems. If you find it doesn't work correctly, ask in Slack.

RAID Controller

MegaRAID

LSI_SAS_EmbMRAID_SWUG.pdf 2006 LSI_SAS_EmbMRAID_SWUG.pdf

ASrock

faq1

faq2

Win-Raid

forum

Help-Problem-to-flash-the-Marvel-SE-card-resolve

Syba-SI-PEX-PCIe-Card-with-Marvell-SATA-Controller

firmware UEFI

firmware DOS

http://members.iinet.net.au/~michaeldd/CDR-A1-UP_1.01_for_Intel_A1_UP_platform.zip

Supermicro superserver bios change cause 960 nvme disappear

https://tinkertry.com/supermicro-superserver-bios-change-can-cause-960-pro-and-evo-to-hide-heres-the-fix

Background knowlodge

Software

Currently we are using Ubuntu Server 20.04.3 LTS.

Boot: systemd-boot

We have replaced grub with systemd-boot. For introduction, see systemd-boot - ArchWiki (archlinux.org).

To configure systemd-boot, use bootctl. To change kernel parameters, modify /etc/kernel/postinst.d/zz-update-systemd-boot. GitHub backup: https://gist.github.com/KiruyaMomochi/9df313c2abc55c1736d457d48abc0f54

Network: netplan

Since Systemd v197, network interfaces use predictable naming schemes. See systemd.net-naming-scheme (www.freedesktop.org) for detail.

Ubuntu use netplan to configure network. It reads network configuration from /etc/netplan/*.yaml, then convert them to systemd-networkd configuration.

Netplan configuration examples: https://netplan.io/examples/.

Drivers

InfiniBand

  1. Download drivers from Linux InfiniBand Drivers (mellanox.com).
  2. tar -xzf MLNX_OFED_LINUX-5.4-3.1.0.0-ubuntu20.04-x86_64.tgz
  3. cd to the directory and
sudo ./mlnxofedinstall --add-kernel-support

Configure IPoIB

For RHEL/CentOS, see IP over InfiniBand (IPoIB) - MLNX_OFED v5.1-0.6.6.0 - Mellanox Docs.

For Ubuntu, create /etc/netplan/10-infiniband.yaml with:

network: version: 2 ethernets: ibp161s0f0: # Name of the InfiniBand interface addresses: - 11.4.3.20/24 # Change to your IP address

You may need to change interface name and ip address to your own.

Ansible

To manage two server at the same time, it's easier to use Ansible.

Network File System (NFS)

NFS is exported from node1. Only NFS v4 is supported:

/srv/nfs4 *(rw,sync,fsid=0,crossmnt,no_subtree_check) /srv/nfs4/home *(rw,sync,no_subtree_check)

/proc/fs/nfsd/versions: -2 -3 +4 +4.1 +4.2

It is mounted on all nodes at /mnt/nfs4 and /mnt/home:

node1:/ /mnt/nfs4 nfs rw,noauto,x-systemd.automount 0 0 node1:/home /mnt/home nfs rw 0 0

You can use /mnt/home/<user> as your home directory:

# On node 1 sudo mkdir /srv/nfs4/home/<user> # On all nodes sudo usermod -d /mnt/home/<user> <user>

Other tools

systemd-nspawn

See systemd-nspawn

Tuning

Enable / Disable SMT (HyperThreading)

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading.

# From https://docs.kernelcare.com/how-to/ # Check the SMT state cat /sys/devices/system/cpu/smt/active #Enable SMT echo on > /sys/devices/system/cpu/smt/control #Disable SMT echo off > /sys/devices/system/cpu/smt/control

Tick-free CPU

When kernel is booted with nohz_full=1-127 set, CPU 1-127 are isolated. Refer CPU Isolation - Nohz_full - by SUSE Labs (part 3) | SUSE Communities for more details.

Also see:

A full list of kernel parameters is available at https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html.

Set kernel.yama.ptrace_scope to 0

For temporary applying, use the following command

sudo sysctl -w kernel.yama.ptrace_scope=0

For permanent, change /etc/sysctl.d/10-ptrace.conf.

For documentation, see The Linux kernel user’s and administrator’s guide » Linux Security Module Usage - Yama.

Kernel

We use a custom kernel with NOHZ support enabled.

Build Kernel on Debian/Ubuntu

To build kernel, refer to Chapter 4. Common kernel-related tasks (pages.debian.net).

Current kernel config is at /usr/src/linux-headers-$(uname -r)/.config.

Kernel/BuildYourOwnKernel - Ubuntu Wiki and BuildADebianKernelPackage - Debian Wiki are obsolete, do not use them.

If you don't want to use module signing:

scripts/config --disable MODULE_SIG scripts/config --disable SYSTEM_TRUSTED_KEYS

Also consider disable debug info:

scripts/config --disable DEBUG_INFO

systemd-nspawn

systemd-nspawn is like the chroot command, but it is a chroot on steroids. See systemd-nspawn - ArchWiki (archlinux.org) and nspawn - Debian Wiki for introduction.

Bootstrap

We can bootstrap a Debian machine using debootstrap, but also try mkosi.

For example, bootstrap a openSUSE image:

python3 -m pip install --user git+git://github.com/systemd/mkosi.git sudo .local/bin/mkosi -d opensuse -t directory -p systemd-container --checksum --password password -o /var/lib/machines/opensuse-test

RDMA

Install

Although there is no document for systemd-nspawn, we can refer to How-to: Deploy RDMA accelerated Docker container over InfiniBand fabric.

Make sure these tools has the same version as host.

We only need to install userspace tools into nspawn container without updating firmware:

./mlnxofedinstall --user-space-only --without-fw-update

Edit .nspawn file

Edit .nspawn file of the container, which is located at /etc/systemd/nspawn/<machine-name>.nspawn. If such a file does not exist, create one.

Then, add following content

[Exec] Capability=CAP_IPC_LOCK LimitMEMLOCK=infinity [Files] Bind=/dev/infiniband/ Bind=/dev/hugepages

Also consider use host network by

[Network] VirtualEthernet=no

Add DeviceAllow

Create a drop-in file use command

sudo systemctl edit systemd-nspawn@<machine-name>

with content of

[Service] DeviceAllow=/dev/infiniband/uverbs0 rwm DeviceAllow=/dev/infiniband/uverbs1 rwm

Put all of devices you want to allow there.

Test

Show status with ibstat. Test RDMA with perftest.

If you find tools like perftest does not work, it may releated to

  • https://gist.github.com/zshi-redhat/c7cfe9e0be63f0330952a28792acff2b
  • Limit on memlock, see below for solution.

Disable memlock limit

IB tools may fail to allocate memory if memlock limit is too small. To show current memlock limit, use

sudo systemctl show systemd-nspawn@<machine-name> --property LimitMEMLOCK

To disable limit, use

sudo systemctl edit systemd-nspawn@<machine-name>

And add LimitMEMLOCK=infinity to [Service] section, then restart your container.

Troubleshooting

No color in terminal

See Arch wiki for "broken colors" problem.

Create file /etc/systemd/system/container-getty@.service.d/term.conf in container with following contents:

[Service] Environment=TERM=xterm-256color

Archived

Pages under this path may outdated and not reflect current setup.

Cluster Setup

Warning This is an outdated guide.

完整流程 & 踩坑笔录

机器信息 & 硬件准备

  • 节点:4 节点 (node1~4,node1 为 主节点
  • 网络:Ethernet(192.168.<A>.x)与 IB(192.168.<B>.x
    • 星状拓扑
    • Setup 时主节点需连接外网
  • 硬盘:每个节点一块系统盘,主节点额外挂一块 SSD 作为共享存储
  • Clonezilla 镜像 U 盘 * 1(镜像直接解压即可,故下述安装时需要 BIOS 设置为 UEFI 模式)
  • Clean Minimal CentOS7 镜像 U 盘 * 1(同上)

CentOS 操作系统安装

下载 CentOS-7 Minimal 镜像 于 U 盘,插于主节点

如果主板 BIOS 启动模式不是 UEFI 则勿忘在启动时修改 ;( 主节点需要使用外置 Clonezilla 镜像 U 盘,故也把 U 盘启动顺序置前

主节点开机 “install CentOS 7”

如果 Install 后触发了 dracut-init... timeout,在之后会跳入 dracut 命令行,输入 lsblk 后找到 U 盘设备,记下 LABEL=AAA 的值,而后 reboot;然后在选择界面按 e,修改第二行中的 LABEL=BBB 的第一段 为 AAA,然后 ctrl+x 即可 另一种方法是将LABEL=CentOS\x207\x20x\86_64修改为LABEL=CentOS\x207\x20x\8 https://blog.csdn.net/qq_36937234/article/details/82996998 https://access.redhat.com/solutions/2515741

需调整项如下:

  • 磁盘分区 / + /boot 即可,根目录各子目录不分散分区,格为ext4
  • 主机名 Hostname 设为 node1

待安装完成,以 root 用户可正常登陆

关闭 SELinux:修改 /etc/selinux/config,设置 SELINUX=disabled

关闭 Firewall 防火墙

systemctl stop firewalld.service systemctl disable firewalld.service

很多问题都会由上述两个安全服务引起,在超算比赛内网环境下无用,先全关闭

以太网连接配置

先配置主节点连接外网,再将各节点内网连接

Ethernet 外网

连接外网以太网线(记住对应网口 <INTERFACE>,e.g. eno2

使用 ip 指令检查 DNS 地址等信息,而后在输入 nmtui 进入 GUI 网络设置界面,设置外网连接为 DHCP 模式,填入 DNS 服务器地址,而后使用 curl 进行校网登录:

$ dhclient -v <INTERFACE> $ curl -X POST --data “userName=<USERNAME>&password=<PASSWD>&hasValidateCode=false&authLan=zh_CN” https://10.15.44.172:8445/PortalServer//Webauth/webAuthAction\!login.action

此时可以连接外网,需要记录下本机 ip 地址以便远程连接(校网中途不关机则 DHCP ip 地址应该不会改变);curl <URL> 检查外网连接

Ethernet 内网

同样使用 ip 工具看到网关地址等信息,使用 nmtui GUI 工具对内网网口(e.g. eno1)进行配置,e.g. 主节点 192.168.<A>.1

驱动下载 & 安装

IB 驱动 & Nvidia 驱动,安装在默认位置(因为共享盘还未配置),故在拷盘前做好

IB 驱动和配置

IB 驱动

Nvidia 驱动

yum install kernel-dev epel-release dkms 来添加 Redhat 源 及 其他 Nvidia 驱动依赖

关闭默认 nouveau 显卡驱动:

$ vi /etc/default/grub # `GRUB_CMDLINE_LINUX` 选项中添加 `nouveau.modeset=0` $ grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg $ reboot

不要用官网的 rpm 包安装驱动 ;(

重启后,在官网查询所用卡对应的最新驱动版本 <VER.SUB>(e.g. V100 目前最新为 410.79),获取安装脚本并安装:

$ wget http://us.download.nvidia.com/XFree86/Linux-x86_64/<VER.SUB>/NVIDIA-Linux-x86_64-<VER.SUB>.run $ bash NVIDIA-Linux-x86_64-<VER.SUB>.run --kernel-source-path /usr/src/kernels/xxx # 若报错加此选项安装试试

nvidia-smi 指令看能否获取到显卡信息

克隆创建子节点

先安装必要的基本工具以减少重复工作:yum -y install <TOOL-NAME>

  • NFS:nfs-utils rpcbind
  • Lmod:environment-modules
  • 其他:gcc gcc-c++ perl wget(通过 yum 预安装 gcc 用于并行库工具等的编译安装)

主节点关机,插入 Clonezilla 工具盘,从它启动,将主节点系统盘克隆至子节点系统盘内(勿搞错拷贝 Source & Target 盘方向):https://www.tecmint.com/linux-centos-ubuntu-disk-cloning-backup-using-clonezilla

子节点插入系统盘后,分别登陆各子节点,修改主机名和静态 ip 地址(内网网口),以便互联识别身份,注意 4 节点 ip 和 主机名 互不相同

# e.g. node2 节点 $ hostnamectl set-hostname node2 $ vi /etc/sysconfig/network-scripts/ifcfg-<INTERFACE> #修改 IPADDR=192.168.<A>.2

数据盘 NFS 共享(over IB RDMA)

howto-configure-nfs-over-rdma--roce-x

Maybe useful according to teacher Zhang

opensmd openibd

数据盘 NFS 共享(over TCP)备选

主节点插上用作共享盘的硬盘,lsblk 查看新硬盘已插上及名称,可看到出现 e.g. sdb1 盘(根据大小判断那个为共享盘,勿搞错)

格式化磁盘流程备忘:

$ fdisk /dev/sdb1 $ n # 新建分区 $ p 1 [Enter] [Enter] # 整个盘建立为一个大主分区

挂载该磁盘并在其中建立欲共享的目录(/home/opt):

$ mount /dev/nvme0n1 /mnt/nfs $ mkdir /mnt/nfs/home $ mkdir /mnt/nfs/opt

主节点启动 NFS server,编辑共享目录配置 /etc/exports,添加条目(注意不多加空格):

/mnt/nfs/home 192.168.<A>.0/24(rw,no_root_squash,no_all_squash,sync) /mnt/nfs/opt 192.168.<A>.0/24(rw,no_root_squash,no_all_squash,sync)

参数解释:

  • rw:可读写
  • no_*_squash:客户节点以 * 身份使用时不降级为匿名普通用户
  • sync:各端的写操作同步至磁盘

开启服务并设置开机自启:

$ exportfs -r $ service rpcbind start $ service nfs start $ chkconfig rpcbind on $ chkconfig nfs on

设置主节点防火墙允许 NFS 访问请求:

$ firewall-cmd --permanent --add-service=mountd $ firewall-cmd --permanent --add-service=nfs $ firewall-cmd --permanent --add-service=rpc-bind $ firewall-cmd --reload

修改 /etc/fstab,使主节点将共享目录 bind mount (目录树到目录树挂载) 到 /home /opt,子节点由 NFS 将主节点目录挂载:

# On node1,在 /etc/fstab 文件末尾添加 /dev/nvme0n1 /mnt/nfs ext4 rw,user,exec,suid,dev,auto,async /mnt/nfs/home /home none rw,user,exec,suid,dev,auto,async,bind /mnt/nfs/opt /opt none rw,user,exec,suid,dev,auto,async,bind # On node2~4,在 /etc/fstab 文件末尾添加 node1:/mnt/nfs/home /home nfs rw,user,exec,suid,dev,auto,async node1:/mnt/nfs/home /opt nfs rw,user,exec,suid,dev,auto,async

而后每次开机后,各节点均登入 root 用户,先在主节点 mount -a后在各子节点 mount -a 即可成功挂载共享目录

全手动挂载方式备忘,开机后首先在主节点:

$ mount /dev/nvme0n1 /mnt/nfs $ mount --bind /mnt/nfs/home /home $ mount --bind /mnt/nfs/opt /opt

而后在各子节点:

$ showmount -e node1 # 检测是否有来自主节点的 nfs 共享 $ mount -t nfs node1:/mnt/nfs/home /home $ mount -t nfs node1:/mnt/nfs/opt /opt

出现 “Stale file handle” 问题 / “Access denied” 问题,在主节点重启 NFS:systemctl restart nfs 后再挂载一遍即可

SSH 免密码登录配置

首先配置 root 用户相互 ssh 免密登陆,所有节点对之间均需配置,e.g. 在主节点 /root 下:

$ ssh-keygen # 位置名称默认 $ ssh-copy-id node1 # 自身节点也需拷贝 $ ... $ ssh-copy-id node4

而后在各自节点 创建普通用户,注意 相同名称 & 相同 uid & 相同 group (gid) & 相同密码

$ useradd <USERNAME> -m $ passwd <USERNAME> $ [Type new PASSWORD] [Type again] # 设置密码,不要通过 useradd 的 -p 选项,密码不规范时会失败

密码通过 passwd 指令设置,否则密码不规范时 -p 选项可能失败且不会给出提示 按相同顺序创建即是,可以通过 cat /etc/passwd 检查

任意节点进入普通用户,生成并拷贝密钥(注意普通用户 Home 目录共享):

$ su testuser $ cd $ ssh-keygen [Enter] [Enter] [Enter] $ ssh-copy-id localhost

编译器、并行库和环境的安装

环境安装目录文件树放置于 /opt

所需环境及安装流程 - 见 “Environment Installation

环境 Environment Modules 配置

前面已下载过 Lmod 工具;共享盘中 mkdir /opt/modulefiles 作为 modulefile 存储位置,而后在 每个 节点固定 modulefile 搜索路径,于 /etc/environment 中添加行:

export MODULEPATH=/opt/modulefiles

勿忘 source /etc/environment

曾用 modulefile 文件 - 见 “Modulefile Records

Environment Installation

Warning This is an outdated guide.

环境安装方式 + 目录树位置

安装目录树结构

|- /opt/ |- openmpi/ |- 4.0 |- 3.1 |- ... |- mpich/ |- intel/ # Intel 全家福 |- blas/ |- gcc/ |- cuda/ # Nvidia CUDA |- pgi/ # CUDA PGi Edition |- netcdf/ |- netcdf-c/ |- netcdf-fort/ |- pnetcdf

编译安装基本六连:

$ wget [SOURCE_URL] $ tar zxvf openmpi-4.0.0.tar.gz $ cd openmpi-4.0.0/ $ ./configure --prefix=‘/opt/mpi/openmpi/4.0’ # 注意规划好位置 $ make -j8 $ make install

包管理系统

spack Environment Modules 二选一, spack 基本上就是对系统级的modules的高层API,从ASC20开始,我们开始使用spack。为保证目录树nfs共享结构,把spack 放在/opt 目录下。

$ git clone https://github.com/spack/spack.git $ cd spack/bin $ ./spack install libelf # test $ echo "export PATH=$PATH:/opt/spack/bin" >> ~/.bashrc $ echo ". /opt/spack/share/spack/setup-env.sh" >> ~/.bashrc $ bash

依赖以及版本spec

$ spack install intel^gcc@9

在测试时使用

$ spack load intel^gcc@9 也可以module avail 后查看需要load 的环境module load intel

添加新的编译器,在自己编译安装好一个编译器并在PATH中可以找到的时候,可以使用spack find命令,之后就可以用这个编译器@intel来编译新的编译器了。

特殊注意,在编译安装mpi,omp过程中一定要开启 --with-rdma 选项以支持Infiniband.

编译器

  1. gcc (包含 gfortran) - Version 7.4 + 5.5 + 4.9.4 + 4.4.7
  2. icc & ifort:包含于 Intel Parallel Studio XE

Intel Parallel Studio XE 全家桶

Parallel Studio XE:按照 This Procedure 获取和安装,19-20 授权如下

  • 序列号 S4ZD-MMZJXJ96 (若失效,可以前往英特尔官网申请,在下方register center,若为spack 安装只需在安装过程中输入即可)
  • URL:parallel_studio_xe_2019_update2_cluster_edition.tgz
  • LICENSE:官网 Registration Center 下载后传至 Server

icc ifort mkl IntelMPI 均包含于 Parallel Studio XE 中 spack install intel

由于cuda只支持编译它的编译器头文件在gcc-7标准以前,所以建议使用intel@18.0.3

MPI

  1. OpenMPI - Version 4.0 + 3.1 + 3.0 + 2.1
  2. MPICH - Version 3.3 + 3.2.1
  3. IntelMPI:包含于 Intel Parallel Studio XE

Nvidia CUDA

  • CUDA Toolkitspack install cuda@10.2 以支持不同版本。
  • PGi Editionspack install pgi@19.10

Math Libraries

  1. MKL:包含于 Intel Parallel Studio XE
  2. OpenBLASspack install openblas

NetCDF I/O

用于ASC19 的IO500题目中。

Environment Modules

Warning This is an outdated guide.

Environment Module: Modulefiles 目录树结构 + 备份

Modulefiles 目录树结构 (Deprecated)

|- /opt/modulefiles # 路径并非完全按照冲突关系组织,modulefile 中冲突关系要注意 |- mpi/ |- openmpi/ |- 4.0 |- 3.1 |- ... |- mpich/ |- intelmpi/ |- math/ |- mkl/ |- blas/ |- compilers/ |- gcc/ |- icc/ |- ifort/ |- cuda/ |- nvidia/ |- pgi/ |- netcdf/ |- pnetcdf/ |- netcdf-c/ |- netcdf-fort/

It-Support Machine

机器简介

我们一共拥有四个图信账号,分别是:GeekPieHPC{1, 2, 3, 4}。 四个账号均位于同一 IP 地址下,请到 Slack 查看机器的 IP 地址

连接方式

这些账号均可使用 ssh 命令连接,或使用 scp 命令进行文件传输,端口号为 22112

四个账号均位于同一 IP 地址下,见 https://geekpiehpc.slack.com/archives/C0210BA22QH/p1631708325019000

  • 密钥: 您可以使用连接到 Epyc 机器的密钥进行登录,示例配置如下

    Host geekpie<数字> HostName <在这里填上目标机器的 IP 地址> User geekpie<数字> Port 22112 IdentityFile ~/.ssh/id_rsa_epyc
  • 密码: 您可在 Slack 中找到密码。

调度器

图信机器使用 PBS (Portable Batch System) 进行调度,其多数命令以 q 开头,如可以使用 qstat 查看调度器的状态。

CPU 队列为 GeekPie_HPC,GPU 队列为 GeekPie_GPU

PBS 的具体使用方式请看 DevOps/Scheduler

环境管理

在图信机器上,首推使用 module 进行环境变量的管理,不过和编译器打交道时,还是使用 spack 安装编译器。

支持与帮助

可以使用以下方式联系图信

  • 微信: 群中的 Saber 为图信联络人。
  • 办公室: 图信办公室位置是 H1 304,H1 楼是医务室那栋楼。

BMC fuck

We think it's worthwhile to reverse the BMC for the following reasons:

  1. Fine-grained adjustment of GPU power consumption (super TDP)
  2. Can be 1 machine 4 cards. (Tsinghua last year on 2 cards)
  3. PCIe device hot-swapping

Also see:

  1. https://github.com/l4rz/reverse-engineering-dell-idrac-to-get-rid-of-gpu-throttling

Salt Stack

salt 主要干的事就是快速配置文件。可也不止那么多,加上 jinja 和 LDAP 可以做一个私钥管理系统。

由于前人维护者跑路了,只能有新来的同学接手了。

linux 的秘密都在 PAM 里

大家熟练使用 journal -x debug 各种系统用户认证的各种状态机的时候,抑或是 sshd 时,可以看到 PAM、xsecurity、sssd 这种字眼,这是用户认证协议。SSSD 就是一个 daemon,把系统的 NSS PAM 的机制和 LDAP 连接起来。(之前 20.04 gnome 被打也是这个协议被绕过的结果。

熟悉了 PAM 之后,你也就知道为毛 ~/.ssh/id_*.pub 需要 r..,这写死在 nss 用户目录协议里了。

继续阅读:https://jia.je/software/2021/02/15/sssd-ldap/

ref

C++

终于到我校专业学习的语言了。

C++ 17 & 20

有关新特性永远是恒久不变的话题。从 C++11 的左右值到现在已经有很久时间了。在epyc 上一般的编译器性能排序是 AOCC>Intel>GCC>>LLVM. 但是MKL相对于amd的优化库还有一定优势,所以有时候Intel开最基本的x86优化也能和AOCC差不多。C++ 17对并行化做了很多优化,比如pstl、for_each(threading)、pmr、intel 的 sycl、nvidia的thrust/cub。可以很方便的修改namespace来获得无痛性能提升。C++ 20 最重要的特性是 ranges、filesystem等。不过LLVM 对新版本的支持一直都是最慢的。之前 icc 可以兼容一些古老的语法比如 VLA,但到了新版本以后取消了,这种不稳定的特性变更导致大厂不怎么把其当成标准。

这是笔者在其编译原理作业中出的一道题,需要给出 semantic rule 来拒绝 ??? 那行。但貌似只能在icc中通过标准。

template<class T>class array{ int s; T* elements; public: array(int n); // allocate "n" elements and let "elements" refer to them array(T* p, int n); // make this array refer to p[0..n-1] operator T*(){return elements;} int size()const{return s;} // the usual container operations, such as = and [], much like vector }; void h(array<double>a); //C++ void g(int m,double vla[m]); //C99 void f(int m,double vla1[m],array<double>a1) { array<double> a2(vla1,m); // a2 refers to vla1 double*p=a1; //p refers to a1's elements g(m,vla1); g(a1.size(),a1); // a bit verbose g(a1); //??? }

有关玄学

所有编译器可能出现的segfault,很多时候换个 intel 小版本可以通过。

有关编译自闭的时候的想法

建议若是 CMake 打开 make -n, configure 打开 make VERBOSE=1。如果需要与展开宏编译器,需要 CMake 开 -e 选项得到展开后的表达式。

需要广泛运用好 man、--help。

编译选项

LTO

为了解决不同库或者跨语言之间调用的开销,这块使用的基本是 LLVM 的 libLTO 和 tblgen。这个是自动开启的,原理是把库都弄成 LLVM bitcode 统一链接,其实并行版的 LTO 也不是很难实现,曾是前队长用 rust 写的并行计算的 project,具体可以参考源码。

PGO

通过分析程序运行时的实际行为,将结果反馈给编译器,使得编译器可以重新安排代码以减少指令缓存问题和分支预测误判,从而获得性能的提升。性能分析引导优化通过实际执行代码统计出运行频率最高的部分,编译器通过这些信息可以更加针对地优化代码。

image

  • 第一阶段:编译参数中加上:-prof-gen=srcpos -prof-dir=/tmp/profdata。其中-prof-dir是存储性能分析文件的目录。
  • 第二阶段:运行编译好的程序,然后运行profmerge -prof_dir /tmp/profdata生成汇总文件。
  • 第三阶段:重新编译程序,使用参数:-prof-use=nomerge -prof-func-groups -prof-dir=/tmp/profdata

IPO

过程间优化和性能分析引导优化可能会相互影响,性能分析引导优化通常会帮助编译器生成内联函数,这会帮助过程间优化的效率。性能分析引导优化对分支预测效率的提升最有效果,许多分支执行的可能性无法在编译时判断,而通过性能分析引导优化,编译器可以针对经常执行的分支(热代码)和不经常执行的分支(冷代码)生成高效的汇编代码。

HLO

从下文我们知道的 LLVM IR 就已经出现的部分优化,我们知道 icc 实际上在 LLVM IR 之前还拥有 high level 的 ir。根据文档,主要做了

  • Loop Permutation or Interchange
  • Loop Distribution
  • Loop Fusion
  • Loop Unrolling
  • Data Prefetching
  • Scalar Replacement
  • Unroll and Jam
  • Loop Blocking or Tiling
  • Partial-Sum Optimization
  • Predicate Optimization
  • Loop Reversal
  • Profile-Guided Loop Unrolling
  • Loop Peeling
  • Data Transformation: Malloc Combining and Memset Combining, - Memory Layout Change
  • Loop Rerolling
  • Memset and Memcpy Recognition
  • Statement Sinking for Creating Perfect Loopnests
  • Multiversioning: Checks include Dependency of Memory References, - and Trip Counts
  • Loop Collapsing

DOP

大家如果写过 UE 这种游戏引擎的程序或者kernel中需要 cache align 的struct,以及最近几年对DB内卷式的优化,会很熟悉这种数据结构。其最重要的思想就是把数据塞在 avx 对齐的 struct 中,所有的操作都是围绕着对struct的加乘运算。详见 https://neil3d.github.io/assets/img/ecs/DOD-Cpp.pdf。

LLVM

DC++/AOCC 都开始使用 LLVM 作为中间层

ICC 大致做了什么新特性

以最新的 2021.3.0 做静态分析,用 saxpy 做标程。意思是 Single-Precision A·X Plus Y。这是一维 BLAS 中的一个函数,经常被写作 kernel 来各种调参各种调寄存器和内存模型。其 C++ 版本如下

void saxpy(int n, float a, float * x, float * y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; }

LLVM IR 如下,完整路径是 https://godbolt.org/z/j5rrxhedG,主要hard code 进了各种预优化好的汇编,尤其是mov高地址这种快速取指方式。感觉是把VADDSS __m128 _mm_mask_add_ss (__m128 s, __mmask8 k, __m128 a, __m128 b);这种 Intel C/C++ Compiler Intrinsic Equivalent 当成库函数一起编译到 IR 上了。

.section .text .LNDBG_TX: # mark_description "Intel(R) C Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.3.0 Build 2021"; # mark_description "0609_000000"; # mark_description "-g -o /app/output.s -masm=intel -S -gxx-name=/opt/compiler-explorer/gcc-10.1.0/bin/g++ -emit-llvm"; .intel_syntax noprefix .file "example.cpp" .text ..TXTST0: .L_2__routine_start_saxpy(int, float, float*, float*)_0: # -- Begin saxpy(int, float, float*, float*) .text # mark_begin; .globl saxpy(int, float, float*, float*) # --- saxpy(int, float, float *, float *) saxpy(int, float, float*, float*): # parameter 1(n): edi # parameter 2(a): xmm0 # parameter 3(x): rsi # parameter 4(y): rdx ..B1.1: # Preds ..B1.0 # Execution count [0.00e+00] .cfi_startproc .cfi_personality 0x3,__gxx_personality_v0 ..___tag_value_saxpy(int, float, float*, float*).2: ..L3: #2.1 ..LN0: .file 1 "/app/example.cpp" .loc 1 2 is_stmt 1 push rbp #2.1 .cfi_def_cfa_offset 16 ..LN1: mov rbp, rsp #2.1 .cfi_def_cfa 6, 16 .cfi_offset 6, -16 ..LN2: sub rsp, 48 #2.1 ..LN3: mov DWORD PTR [-40+rbp], edi #2.1 ..LN4: movss DWORD PTR [-32+rbp], xmm0 #2.1 ..LN5: mov QWORD PTR [-24+rbp], rsi #2.1 ..LN6: mov QWORD PTR [-16+rbp], rdx #2.1 ..LN7: .loc 1 3 prologue_end is_stmt 1 mov DWORD PTR [-48+rbp], 0 #3.14 ..LN8: # LOE rbx rbp rsp r12 r13 r14 r15 rip ..B1.2: # Preds ..B1.3 ..B1.1 # Execution count [0.00e+00] ..LN9: mov eax, DWORD PTR [-48+rbp] #3.19 ..LN10: mov edx, DWORD PTR [-40+rbp] #3.23 ..LN11: cmp eax, edx #3.23 ..LN12: jge ..B1.4 # Prob 50% #3.23 ..LN13: # LOE rbx rbp rsp r12 r13 r14 r15 rip ..B1.3: # Preds ..B1.2 # Execution count [0.00e+00] ..LN14: .loc 1 4 is_stmt 1 movss xmm0, DWORD PTR [-32+rbp] #4.14 ..LN15: mov eax, DWORD PTR [-48+rbp] #4.18 ..LN16: movsxd rax, eax #4.16 ..LN17: imul rax, rax, 4 #4.16 ..LN18: add rax, QWORD PTR [-24+rbp] #4.16 ..LN19: movss xmm1, DWORD PTR [rax] #4.16 ..LN20: mulss xmm0, xmm1 #4.16 ..LN21: mov eax, DWORD PTR [-48+rbp] #4.25 ..LN22: movsxd rax, eax #4.23 ..LN23: imul rax, rax, 4 #4.23 ..LN24: add rax, QWORD PTR [-16+rbp] #4.23 ..LN25: movss xmm1, DWORD PTR [rax] #4.23 ..LN26: addss xmm0, xmm1 #4.23 ..LN27: mov eax, DWORD PTR [-48+rbp] #4.9 ..LN28: movsxd rax, eax #4.7 ..LN29: imul rax, rax, 4 #4.7 ..LN30: add rax, QWORD PTR [-16+rbp] #4.7 ..LN31: movss DWORD PTR [rax], xmm0 #4.7 ..LN32: .loc 1 3 is_stmt 1 mov eax, 1 #3.28 ..LN33: add eax, DWORD PTR [-48+rbp] #3.28 ..LN34: mov DWORD PTR [-48+rbp], eax #3.28 ..LN35: jmp ..B1.2 # Prob 100% #3.28 ..LN36: # LOE rbx rbp rsp r12 r13 r14 r15 rip ..B1.4: # Preds ..B1.2 # Execution count [0.00e+00] ..LN37: .loc 1 5 epilogue_begin is_stmt 1 leave #5.1 .cfi_restore 6 ..LN38: ret #5.1 ..LN39: # LOE ..LN40: .cfi_endproc # mark_end; .type saxpy(int, float, float*, float*),@function .size saxpy(int, float, float*, float*),.-saxpy(int, float, float*, float*) ..LNsaxpy(int, float, float*, float*).41: .LNsaxpy(int, float, float*, float*): .data # -- End saxpy(int, float, float*, float*) .data .section .note.GNU-stack, "" // -- Begin DWARF2 SEGMENT .debug_info .section .debug_info .debug_info_seg: .align 1 .4byte 0x000000be ....

汇编如下,可以看到每一个分支都有概率预测。自动向量化。

saxpy(int, float, float*, float*): mov r9, rsi #2.1 test edi, edi #3.23 jle ..B1.36 # Prob 50% #3.23 cmp edi, 6 #3.3 jle ..B1.30 # Prob 50% #3.3 movsxd r8, edi #1.6 mov rax, rdx #4.16 sub rax, r9 #4.16 lea rcx, QWORD PTR [r8*4] #3.3 cmp rax, rcx #3.3 jge ..B1.5 # Prob 50% #3.3 neg rax #4.23 cmp rax, rcx #3.3 jl ..B1.30 # Prob 50% #3.3 ..B1.5: # Preds ..B1.4 ..B1.3 cmp edi, 8 #3.3 jl ..B1.38 # Prob 10% #3.3 mov r10, rdx #3.3 and r10, 15 #3.3 test r10d, r10d #3.3 je ..B1.9 # Prob 50% #3.3 test r10d, 3 #3.3 jne ..B1.38 # Prob 10% #3.3 neg r10d #3.3 add r10d, 16 #3.3 shr r10d, 2 #3.3 ..B1.9: # Preds ..B1.8 ..B1.6 lea eax, DWORD PTR [8+r10] #3.3 cmp edi, eax #3.3 jl ..B1.38 # Prob 10% #3.3 mov esi, edi #3.3 xor ecx, ecx #3.3 sub esi, r10d #3.3 and esi, 7 #3.3 neg esi #3.3 add esi, edi #3.3 mov eax, r10d #3.3 test r10d, r10d #3.3 jbe ..B1.14 # Prob 9% #3.3 ..B1.12: # Preds ..B1.10 ..B1.12 movss xmm1, DWORD PTR [r9+rcx*4] #4.16 mulss xmm1, xmm0 #4.16 addss xmm1, DWORD PTR [rdx+rcx*4] #4.23 movss DWORD PTR [rdx+rcx*4], xmm1 #4.7 inc rcx #3.3 cmp rcx, rax #3.3 jb ..B1.12 # Prob 82% #3.3 ..B1.14: # Preds ..B1.12 ..B1.10 lea rcx, QWORD PTR [r9+rax*4] #4.16 test rcx, 15 #3.3 je ..B1.18 # Prob 60% #3.3 movaps xmm1, xmm0 #1.6 shufps xmm1, xmm1, 0 #1.6 movsxd rcx, esi #3.3 ..B1.16: # Preds ..B1.16 ..B1.15 movups xmm2, XMMWORD PTR [r9+rax*4] #4.16 movups xmm3, XMMWORD PTR [16+r9+rax*4] #4.16 mulps xmm2, xmm1 #4.16 mulps xmm3, xmm1 #4.16 addps xmm2, XMMWORD PTR [rdx+rax*4] #4.23 addps xmm3, XMMWORD PTR [16+rdx+rax*4] #4.23 movups XMMWORD PTR [rdx+rax*4], xmm2 #4.7 movups XMMWORD PTR [16+rdx+rax*4], xmm3 #4.7 add rax, 8 #3.3 cmp rax, rcx #3.3 jb ..B1.16 # Prob 82% #3.3 jmp ..B1.21 # Prob 100% #3.3 ..B1.18: # Preds ..B1.14 movaps xmm1, xmm0 #1.6 shufps xmm1, xmm1, 0 #1.6 movsxd rcx, esi #3.3 ..B1.19: # Preds ..B1.19 ..B1.18 movups xmm2, XMMWORD PTR [r9+rax*4] #4.16 movups xmm3, XMMWORD PTR [16+r9+rax*4] #4.16 mulps xmm2, xmm1 #4.16 mulps xmm3, xmm1 #4.16 addps xmm2, XMMWORD PTR [rdx+rax*4] #4.23 addps xmm3, XMMWORD PTR [16+rdx+rax*4] #4.23 movups XMMWORD PTR [rdx+rax*4], xmm2 #4.7 movups XMMWORD PTR [16+rdx+rax*4], xmm3 #4.7 add rax, 8 #3.3 cmp rax, rcx #3.3 jb ..B1.19 # Prob 82% #3.3 ..B1.21: # Preds ..B1.19 ..B1.16 lea eax, DWORD PTR [1+rsi] #3.3 cmp eax, edi #3.3 ja ..B1.36 # Prob 50% #3.3 sub r8, rcx #3.3 cmp r8, 4 #3.3 jl ..B1.39 # Prob 10% #3.3 mov eax, r8d #3.3 xor r10d, r10d #3.3 and eax, -4 #3.3 lea rdi, QWORD PTR [rdx+rcx*4] #4.23 movsxd rax, eax #3.3 lea rcx, QWORD PTR [r9+rcx*4] #4.16 ..B1.24: # Preds ..B1.24 ..B1.23 movups xmm2, XMMWORD PTR [rcx+r10*4] #4.16 mulps xmm2, xmm1 #4.16 addps xmm2, XMMWORD PTR [rdi+r10*4] #4.23 movups XMMWORD PTR [rdi+r10*4], xmm2 #4.7 add r10, 4 #3.3 cmp r10, rax #3.3 jb ..B1.24 # Prob 82% #3.3 ..B1.26: # Preds ..B1.24 ..B1.39 cmp rax, r8 #3.3 jae ..B1.36 # Prob 9% #3.3 movsxd rsi, esi #4.7 lea rcx, QWORD PTR [rdx+rsi*4] #4.23 lea rdx, QWORD PTR [r9+rsi*4] #4.16 ..B1.28: # Preds ..B1.28 ..B1.27 movss xmm1, DWORD PTR [rdx+rax*4] #4.16 mulss xmm1, xmm0 #4.16 addss xmm1, DWORD PTR [rcx+rax*4] #4.23 movss DWORD PTR [rcx+rax*4], xmm1 #4.7 inc rax #3.3 cmp rax, r8 #3.3 jb ..B1.28 # Prob 82% #3.3 jmp ..B1.36 # Prob 100% #3.3 ..B1.30: # Preds ..B1.4 ..B1.2 mov eax, edi #3.3 mov esi, 1 #3.3 xor ecx, ecx #3.3 shr eax, 1 #3.3 je ..B1.34 # Prob 9% #3.3 ..B1.32: # Preds ..B1.30 ..B1.32 movss xmm1, DWORD PTR [r9+rcx*8] #4.16 mulss xmm1, xmm0 #4.16 addss xmm1, DWORD PTR [rdx+rcx*8] #4.23 movss DWORD PTR [rdx+rcx*8], xmm1 #4.7 movss xmm2, DWORD PTR [4+r9+rcx*8] #4.16 mulss xmm2, xmm0 #4.16 addss xmm2, DWORD PTR [4+rdx+rcx*8] #4.23 movss DWORD PTR [4+rdx+rcx*8], xmm2 #4.7 inc rcx #3.3 cmp rcx, rax #3.3 jb ..B1.32 # Prob 63% #3.3 lea esi, DWORD PTR [1+rcx+rcx] #4.7 ..B1.34: # Preds ..B1.33 ..B1.30 lea eax, DWORD PTR [-1+rsi] #3.3 cmp eax, edi #3.3 jae ..B1.36 # Prob 9% #3.3 movsxd rsi, esi #3.3 movss xmm1, DWORD PTR [-4+r9+rsi*4] #4.16 mulss xmm0, xmm1 #4.16 addss xmm0, DWORD PTR [-4+rdx+rsi*4] #4.23 movss DWORD PTR [-4+rdx+rsi*4], xmm0 #4.7 ..B1.36: # Preds ..B1.28 ..B1.21 ..B1.34 ..B1.38 ..B1.1 ret #5.1 ..B1.38: # Preds ..B1.5 ..B1.7 ..B1.9 xor esi, esi #3.3 cmp edi, 1 #3.3 jb ..B1.36 # Prob 50% #3.3 ..B1.39: # Preds ..B1.22 ..B1.38 xor eax, eax #3.3 jmp ..B1.26 # Prob 100% #3.3

下面是 AOCC LLVM IR emit 出来的代码,并没有在 IR 上做什么文章,和 clang emit 的基本一样。

; ModuleID = './a.c' source_filename = "./a.c" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" ; Function Attrs: noinline nounwind optnone uwtable define dso_local void @saxpy(i32 %n, float %a, float* %x, float* %y) #0 { entry: %n.addr = alloca i32, align 4 %a.addr = alloca float, align 4 %x.addr = alloca float*, align 8 %y.addr = alloca float*, align 8 %i = alloca i32, align 4 store i32 %n, i32* %n.addr, align 4 store float %a, float* %a.addr, align 4 store float* %x, float** %x.addr, align 8 store float* %y, float** %y.addr, align 8 store i32 0, i32* %i, align 4 br label %for.cond for.cond: ; preds = %for.inc, %entry %0 = load i32, i32* %i, align 4 %1 = load i32, i32* %n.addr, align 4 %cmp = icmp slt i32 %0, %1 br i1 %cmp, label %for.body, label %for.end for.body: ; preds = %for.cond %2 = load float, float* %a.addr, align 4 %3 = load float*, float** %x.addr, align 8 %4 = load i32, i32* %i, align 4 %idxprom = sext i32 %4 to i64 %arrayidx = getelementptr inbounds float, float* %3, i64 %idxprom %5 = load float, float* %arrayidx, align 4 %mul = fmul float %2, %5 %6 = load float*, float** %y.addr, align 8 %7 = load i32, i32* %i, align 4 %idxprom1 = sext i32 %7 to i64 %arrayidx2 = getelementptr inbounds float, float* %6, i64 %idxprom1 %8 = load float, float* %arrayidx2, align 4 %add = fadd float %mul, %8 %9 = load float*, float** %y.addr, align 8 %10 = load i32, i32* %i, align 4 %idxprom3 = sext i32 %10 to i64 %arrayidx4 = getelementptr inbounds float, float* %9, i64 %idxprom3 store float %add, float* %arrayidx4, align 4 br label %for.inc for.inc: ; preds = %for.body %11 = load i32, i32* %i, align 4 %inc = add nsw i32 %11, 1 store i32 %inc, i32* %i, align 4 br label %for.cond for.end: ; preds = %for.cond ret void } attributes #0 = { noinline nounwind optnone uwtable "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" } !llvm.module.flags = !{!0} !llvm.ident = !{!1} !0 = !{i32 1, !"wchar_size", i32 4} !1 = !{!"AMD clang version 12.0.0 (CLANG: AOCC_3.0.0-Build#78 2020_12_10) (based on LLVM Mirror.Version.12.0.0)"}

和 icc 编译的向量化部分基本一样,但没有概率模型,可惜上面的概率模型的 cost model 是 intel processor 的,所以最终结果icc和aocc不分伯仲。

.text .file "a.c" .globl saxpy # -- Begin function saxpy .p2align 4, 0x90 .type saxpy,@function saxpy: # @saxpy .cfi_startproc # %bb.0: # %entry testl %edi, %edi jle .LBB0_16 # %bb.1: # %for.body.preheader movl %edi, %r9d cmpl $7, %edi jbe .LBB0_2 # %bb.7: # %vector.memcheck leaq (%rsi,%r9,4), %rax cmpq %rdx, %rax jbe .LBB0_9 # %bb.8: # %vector.memcheck leaq (%rdx,%r9,4), %rax cmpq %rsi, %rax jbe .LBB0_9 .LBB0_2: xorl %ecx, %ecx .LBB0_3: # %for.body.preheader23 movq %rcx, %rax notq %rax testb $1, %r9b je .LBB0_5 # %bb.4: # %for.body.prol movss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero mulss %xmm0, %xmm1 addss (%rdx,%rcx,4), %xmm1 movss %xmm1, (%rdx,%rcx,4) orq $1, %rcx .LBB0_5: # %for.body.prol.loopexit addq %r9, %rax je .LBB0_16 .p2align 4, 0x90 .LBB0_6: # %for.body # =>This Inner Loop Header: Depth=1 movss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero mulss %xmm0, %xmm1 addss (%rdx,%rcx,4), %xmm1 movss %xmm1, (%rdx,%rcx,4) movss 4(%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero mulss %xmm0, %xmm1 addss 4(%rdx,%rcx,4), %xmm1 movss %xmm1, 4(%rdx,%rcx,4) addq $2, %rcx cmpq %rcx, %r9 jne .LBB0_6 jmp .LBB0_16 .LBB0_9: # %vector.ph movl %r9d, %ecx andl $-8, %ecx movaps %xmm0, %xmm1 shufps $0, %xmm0, %xmm1 # xmm1 = xmm1[0,0],xmm0[0,0] leaq -8(%rcx), %rax movq %rax, %r8 shrq $3, %r8 addq $1, %r8 testq %rax, %rax je .LBB0_10 # %bb.11: # %vector.ph.new movq %r8, %rax andq $-2, %rax negq %rax xorl %edi, %edi .p2align 4, 0x90 .LBB0_12: # %vector.body # =>This Inner Loop Header: Depth=1 movups (%rsi,%rdi,4), %xmm2 movups 16(%rsi,%rdi,4), %xmm3 mulps %xmm1, %xmm2 mulps %xmm1, %xmm3 movups (%rdx,%rdi,4), %xmm4 addps %xmm2, %xmm4 movups 16(%rdx,%rdi,4), %xmm2 addps %xmm3, %xmm2 movups 32(%rdx,%rdi,4), %xmm3 movups 48(%rdx,%rdi,4), %xmm5 movups %xmm4, (%rdx,%rdi,4) movups %xmm2, 16(%rdx,%rdi,4) movups 32(%rsi,%rdi,4), %xmm2 movups 48(%rsi,%rdi,4), %xmm4 mulps %xmm1, %xmm2 addps %xmm3, %xmm2 mulps %xmm1, %xmm4 addps %xmm5, %xmm4 movups %xmm2, 32(%rdx,%rdi,4) movups %xmm4, 48(%rdx,%rdi,4) addq $16, %rdi addq $2, %rax jne .LBB0_12 # %bb.13: # %middle.block.unr-lcssa testb $1, %r8b je .LBB0_15 .LBB0_14: # %vector.body.epil movups (%rsi,%rdi,4), %xmm2 movups 16(%rsi,%rdi,4), %xmm3 mulps %xmm1, %xmm2 mulps %xmm1, %xmm3 movups (%rdx,%rdi,4), %xmm1 addps %xmm2, %xmm1 movups 16(%rdx,%rdi,4), %xmm2 addps %xmm3, %xmm2 movups %xmm1, (%rdx,%rdi,4) movups %xmm2, 16(%rdx,%rdi,4) .LBB0_15: # %middle.block cmpq %r9, %rcx jne .LBB0_3 .LBB0_16: # %for.cond.cleanup retq .LBB0_10: xorl %edi, %edi testb $1, %r8b jne .LBB0_14 jmp .LBB0_15 .Lfunc_end0: .size saxpy, .Lfunc_end0-saxpy .cfi_endproc # -- End function .ident "AMD clang version 12.0.0 (CLANG: AOCC_3.0.0-Build#78 2020_12_10) (based on LLVM Mirror.Version.12.0.0)" .section ".note.GNU-stack","",@progbits .addrsig

Another test on NVHPC, actually you can hack the backend CPU part using AOCC with nvc -march=zen2 -Mvect=simd:256 -Mcache_align -fma -S a.c.

; ModuleID = 'a.c' target datalayout = "e-p:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-pc-linux-gnu" define internal void @pgCplus_compiled.() noinline { L.entry: ret void } define void @saxpy(i32 signext %n.arg, float %a.arg, float* %x.arg, float* %y.arg) #0 !dbg !17 { L.entry: %n.addr = alloca i32, align 4 %a.addr = alloca float, align 4 %x.addr = alloca float*, align 8 %y.addr = alloca float*, align 8 %.ndi0002.addr = alloca i32, align 4 %.ndi0003.addr = alloca i32, align 4 %.vv0000.addr = alloca i8*, align 8 %.vv0001.addr = alloca i8*, align 8 %.vv0002.addr = alloca i8*, align 8 %.r1.0148.addr = alloca <8 x float>, align 4 %.lcr010001.addr = alloca i32, align 4 store i32 %n.arg, i32* %n.addr, align 4, !tbaa !29 store float %a.arg, float* %a.addr, align 4, !tbaa !29 store float* %x.arg, float** %x.addr, align 8, !tbaa !30 store float* %y.arg, float** %y.addr, align 8, !tbaa !30 %0 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23 %1 = icmp sle i32 %0, 0, !dbg !23 br i1 %1, label %L.B0005, label %L.B0014, !dbg !23 L.B0014: %2 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23 %3 = bitcast float* %2 to i8*, !dbg !23 %4 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23 %5 = bitcast float* %4 to i8*, !dbg !23 %6 = ptrtoint i8* %5 to i64, !dbg !23 %7 = sub i64 0, %6, !dbg !23 %8 = getelementptr i8, i8* %3, i64 %7, !dbg !23 %9 = icmp ule i8* %8, null, !dbg !23 br i1 %9, label %L.B0008, label %L.B0015, !dbg !23 L.B0015: %10 = bitcast float* %2 to i8*, !dbg !23 %11 = bitcast float* %4 to i8*, !dbg !23 %12 = ptrtoint i8* %11 to i64, !dbg !23 %13 = sub i64 0, %12, !dbg !23 %14 = getelementptr i8, i8* %10, i64 %13, !dbg !23 %15 = inttoptr i64 32 to i8*, !dbg !23 %16 = icmp ult i8* %14, %15, !dbg !23 br i1 %16, label %L.B0007, label %L.B0008, !dbg !23 L.B0008: store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23 %17 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23 store i32 %17, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23 %18 = icmp slt i32 %17, 8, !dbg !23 br i1 %18, label %L.B0011, label %L.B0016, !dbg !23 L.B0016: store i8* null, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23 %19 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23 %20 = bitcast float* %19 to i8*, !dbg !23 store i8* %20, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23 %21 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23 %22 = bitcast float* %21 to i8*, !dbg !23 store i8* %22, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23 %23 = sub i32 %17, 7, !dbg !23 store i32 %23, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23 %24 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23 %25 = insertelement <8 x float> undef, float %24, i32 0, !dbg !23 %26 = shufflevector <8 x float> %25, <8 x float> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, !dbg !23 store <8 x float> %26, <8 x float>* %.r1.0148.addr, align 1, !tbaa !29, !dbg !23 br label %L.B0012 L.B0012: %27 = load <8 x float>, <8 x float>* %.r1.0148.addr, align 4, !tbaa !29, !dbg !23 %28 = load i8*, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23 %29 = load i8*, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23 %30 = ptrtoint i8* %29 to i64, !dbg !23 %31 = getelementptr i8, i8* %28, i64 %30, !dbg !23 %32 = bitcast i8* %31 to <8 x float>*, !dbg !23 %33 = load <8 x float>, <8 x float>* %32, align 4, !tbaa !29, !dbg !23 %34 = load i8*, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23 %35 = getelementptr i8, i8* %34, i64 %30, !dbg !23 %36 = bitcast i8* %35 to <8 x float>*, !dbg !23 %37 = load <8 x float>, <8 x float>* %36, align 4, !tbaa !29, !dbg !23 %38 = call <8 x float> @llvm.fma.v8f32 (<8 x float> %27, <8 x float> %33, <8 x float> %37), !dbg !23 store <8 x float> %38, <8 x float>* %36, align 1, !tbaa !29, !dbg !23 %39 = getelementptr i8, i8* %29, i64 32, !dbg !23 store i8* %39, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23 %40 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23 %41 = sub i32 %40, 8, !dbg !23 store i32 %41, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23 %42 = icmp sgt i32 %41, 0, !dbg !23 br i1 %42, label %L.B0012, label %L.B0017, !llvm.loop !24, !dbg !23 L.B0017: %43 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23 %44 = add i32 %43, 7, !dbg !23 store i32 %44, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23 %45 = icmp eq i32 %44, 0, !dbg !23 br i1 %45, label %L.B0013, label %L.B0018, !dbg !23 L.B0018: %46 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23 %47 = and i32 %46, -8, !dbg !23 store i32 %47, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23 br label %L.B0011 L.B0011: %48 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23 %49 = sext i32 %48 to i64, !dbg !23 %50 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23 %51 = getelementptr float, float* %50, i64 %49, !dbg !23 %52 = load float, float* %51, align 4, !tbaa !29, !dbg !23 %53 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23 %54 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23 %55 = getelementptr float, float* %54, i64 %49, !dbg !23 %56 = load float, float* %55, align 4, !tbaa !29, !dbg !23 %57 = call float @llvm.fma.f32 (float %53, float %56, float %52), !dbg !23 store float %57, float* %51, align 4, !tbaa !29, !dbg !23 %58 = add i32 %48, 1, !dbg !23 store i32 %58, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23 %59 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23 %60 = sub i32 %59, 1, !dbg !23 store i32 %60, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23 %61 = icmp sgt i32 %60, 0, !dbg !23 br i1 %61, label %L.B0011, label %L.B0013, !llvm.loop !24, !dbg !23 L.B0013: br label %L.B0009, !dbg !23 L.B0007: store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23 %62 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23 store i32 %62, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23 br label %L.B0010 L.B0010: %63 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23 %64 = sext i32 %63 to i64, !dbg !23 %65 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23 %66 = getelementptr float, float* %65, i64 %64, !dbg !23 %67 = load float, float* %66, align 4, !tbaa !29, !dbg !23 %68 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23 %69 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23 %70 = getelementptr float, float* %69, i64 %64, !dbg !23 %71 = load float, float* %70, align 4, !tbaa !29, !dbg !23 %72 = call float @llvm.fma.f32 (float %68, float %71, float %67), !dbg !23 store float %72, float* %66, align 4, !tbaa !29, !dbg !23 %73 = add i32 %63, 1, !dbg !23 store i32 %73, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23 %74 = load i32, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23 %75 = icmp slt i32 %73, %74, !dbg !23 br i1 %75, label %L.B0010, label %L.B0009, !dbg !23 L.B0009: br label %L.B0005 L.B0005: ret void, !dbg !26 } declare float @llvm.fma.f32(float, float, float) declare <8 x float> @llvm.fma.v8f32(<8 x float>, <8 x float>, <8 x float>) declare i32 @__gxx_personality_v0(...) ; Named metadata !llvm.module.flags = !{ !1, !2 } !llvm.dbg.cu = !{ !10 } ; Metadata !1 = !{ i32 2, !"Dwarf Version", i32 4 } !2 = !{ i32 2, !"Debug Info Version", i32 3 } !3 = !DIFile(filename: "a.c", directory: "/home/victoryang") ; !4 = !DIFile(tag: DW_TAG_file_type, pair: !3) !4 = !{ i32 41, !3 } !5 = !{ } !6 = !{ } !7 = !{ !17 } !8 = !{ } !9 = !{ } !10 = distinct !DICompileUnit(file: !3, language: DW_LANG_C_plus_plus, producer: " NVC++ 21.5-0", enums: !5, retainedTypes: !6, globals: !8, emissionKind: FullDebug, imports: !9) !11 = !DIBasicType(tag: DW_TAG_base_type, name: "int", size: 32, align: 32, encoding: DW_ATE_signed) !12 = !DIBasicType(tag: DW_TAG_base_type, name: "float", size: 32, align: 32, encoding: DW_ATE_float) !13 = !DIDerivedType(tag: DW_TAG_pointer_type, size: 64, align: 64, baseType: !12) !14 = !{ null, !11, !12, !13, !13 } !15 = !DISubroutineType(types: !14) !16 = !{ } !17 = distinct !DISubprogram(file: !3, scope: !10, name: "saxpy", line: 2, type: !15, spFlags: 8, unit: !10, scopeLine: 2) !18 = !DILocation(line: 2, column: 1, scope: !17) !19 = !DILexicalBlock(file: !3, scope: !17, line: 2, column: 1) !20 = !DILocation(line: 2, column: 1, scope: !19) !21 = !DILexicalBlock(file: !3, scope: !19, line: 2, column: 1) !22 = !DILocation(line: 2, column: 1, scope: !21) !23 = !DILocation(line: 3, column: 1, scope: !21) !24 = !{ !24, !25 } !25 = !{ !"llvm.loop.vectorize.enable", i1 0 } !26 = !DILocation(line: 5, column: 1, scope: !19) !27 = !{ !"PGI C[++] TBAA" } !28 = !{ !"omnipotent char", !27, i64 0 } !29 = !{ !28, !28, i64 0 } !30 = !{ !"<T>*", !28, i64 0 } !31 = !{ !"int", !28, i64 0 } !32 = !{ !31, !31, i64 0 } !33 = !{ !"float", !28, i64 0 } !34 = !{ !33, !33, i64 0 }

and

.text .file "a.ll" .globl saxpy # -- Begin function saxpy .p2align 4, 0x90 .type saxpy,@function saxpy: # @saxpy .Lfunc_begin0: .file 1 "/home/victoryang/a.c" .loc 1 2 0 # a.c:2:0 .cfi_sections .debug_frame .cfi_startproc # %bb.0: # %L.entry .loc 1 3 1 prologue_end # a.c:3:1 testl %edi, %edi jle .LBB0_19 # %bb.1: # %L.B0014 movq %rdx, %rax subq %rsi, %rax je .LBB0_11 # %bb.2: # %L.B0014 cmpq $31, %rax ja .LBB0_11 # %bb.3: # %L.B0010.preheader movl %edi, %eax cmpl $31, %edi jbe .LBB0_4 # %bb.5: # %vector.memcheck leaq (%rsi,%rax,4), %rcx cmpq %rdx, %rcx jbe .LBB0_7 # %bb.6: # %vector.memcheck .loc 1 0 1 is_stmt 0 # a.c:0:1 leaq (%rdx,%rax,4), %rcx .loc 1 3 1 # a.c:3:1 cmpq %rsi, %rcx jbe .LBB0_7 .LBB0_4: .loc 1 0 1 # a.c:0:1 xorl %ecx, %ecx .p2align 4, 0x90 .LBB0_10: # %L.B0010 # =>This Inner Loop Header: Depth=1 .loc 1 3 1 # a.c:3:1 vmovss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero vfmadd213ss (%rdx,%rcx,4), %xmm0, %xmm1 # xmm1 = (xmm0 * xmm1) + mem vmovss %xmm1, (%rdx,%rcx,4) incq %rcx cmpq %rcx, %rax jne .LBB0_10 jmp .LBB0_19 .LBB0_11: # %L.B0008 .loc 1 0 1 # a.c:0:1 xorl %ecx, %ecx .loc 1 3 1 # a.c:3:1 cmpl $8, %edi jge .LBB0_13 # %bb.12: .loc 1 0 1 # a.c:0:1 movl %edi, %eax jmp .LBB0_17 .LBB0_13: # %L.B0016 .loc 1 3 1 # a.c:3:1 vbroadcastss %xmm0, %ymm1 xorl %ecx, %ecx movl %edi, %eax .p2align 4, 0x90 .LBB0_14: # %L.B0012 # =>This Inner Loop Header: Depth=1 vmovups (%rsi,%rcx), %ymm2 movl %eax, %r8d vfmadd213ps (%rdx,%rcx), %ymm1, %ymm2 # ymm2 = (ymm1 * ymm2) + mem leal -8(%r8), %eax addl $-7, %r8d vmovups %ymm2, (%rdx,%rcx) addq $32, %rcx cmpl $8, %r8d jg .LBB0_14 # %bb.15: # %L.B0017 testl %eax, %eax je .LBB0_19 # %bb.16: # %L.B0018 andl $-8, %edi movl %edi, %ecx .LBB0_17: # %L.B0011.preheader incl %eax .p2align 4, 0x90 .LBB0_18: # %L.B0011 # =>This Inner Loop Header: Depth=1 movslq %ecx, %rcx decl %eax vmovss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero vfmadd213ss (%rdx,%rcx,4), %xmm0, %xmm1 # xmm1 = (xmm0 * xmm1) + mem vmovss %xmm1, (%rdx,%rcx,4) incl %ecx cmpl $1, %eax jg .LBB0_18 .Ltmp0: .LBB0_19: # %L.B0005 .loc 1 5 1 is_stmt 1 # a.c:5:1 vzeroupper retq .LBB0_7: # %vector.ph .Ltmp1: .loc 1 3 1 # a.c:3:1 vbroadcastss %xmm0, %ymm1 movl %eax, %ecx xorl %edi, %edi andl $-32, %ecx .p2align 4, 0x90 .LBB0_8: # %vector.body # =>This Inner Loop Header: Depth=1 vmovups (%rsi,%rdi,4), %ymm2 vmovups 32(%rsi,%rdi,4), %ymm3 vmovups 64(%rsi,%rdi,4), %ymm4 vmovups 96(%rsi,%rdi,4), %ymm5 vfmadd213ps (%rdx,%rdi,4), %ymm1, %ymm2 # ymm2 = (ymm1 * ymm2) + mem vfmadd213ps 32(%rdx,%rdi,4), %ymm1, %ymm3 # ymm3 = (ymm1 * ymm3) + mem vfmadd213ps 64(%rdx,%rdi,4), %ymm1, %ymm4 # ymm4 = (ymm1 * ymm4) + mem vfmadd213ps 96(%rdx,%rdi,4), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem vmovups %ymm2, (%rdx,%rdi,4) vmovups %ymm3, 32(%rdx,%rdi,4) vmovups %ymm4, 64(%rdx,%rdi,4) vmovups %ymm5, 96(%rdx,%rdi,4) addq $32, %rdi cmpq %rdi, %rcx jne .LBB0_8 # %bb.9: # %middle.block cmpq %rax, %rcx jne .LBB0_10 jmp .LBB0_19 .Ltmp2: .Lfunc_end0: .size saxpy, .Lfunc_end0-saxpy .cfi_endproc # -- End function .section .debug_abbrev,"",@progbits .byte 1 # Abbreviation Code .byte 17 # DW_TAG_compile_unit .byte 1 # DW_CHILDREN_yes .byte 37 # DW_AT_producer .byte 14 # DW_FORM_strp .byte 19 # DW_AT_language .byte 5 # DW_FORM_data2 .byte 3 # DW_AT_name .byte 14 # DW_FORM_strp .byte 16 # DW_AT_stmt_list .byte 23 # DW_FORM_sec_offset .byte 27 # DW_AT_comp_dir .byte 14 # DW_FORM_strp .ascii "\264B" # DW_AT_GNU_pubnames .byte 25 # DW_FORM_flag_present .byte 17 # DW_AT_low_pc .byte 1 # DW_FORM_addr .byte 18 # DW_AT_high_pc .byte 6 # DW_FORM_data4 .byte 0 # EOM(1) .byte 0 # EOM(2) .byte 2 # Abbreviation Code .byte 46 # DW_TAG_subprogram .byte 0 # DW_CHILDREN_no .byte 17 # DW_AT_low_pc .byte 1 # DW_FORM_addr .byte 18 # DW_AT_high_pc .byte 6 # DW_FORM_data4 .byte 64 # DW_AT_frame_base .byte 24 # DW_FORM_exprloc .byte 3 # DW_AT_name .byte 14 # DW_FORM_strp .byte 58 # DW_AT_decl_file .byte 11 # DW_FORM_data1 .byte 59 # DW_AT_decl_line .byte 11 # DW_FORM_data1 .byte 63 # DW_AT_external .byte 25 # DW_FORM_flag_present .byte 0 # EOM(1) .byte 0 # EOM(2) .byte 0 # EOM(3) .section .debug_info,"",@progbits .Lcu_begin0: .long .Ldebug_info_end0-.Ldebug_info_start0 # Length of Unit .Ldebug_info_start0: .short 4 # DWARF version number .long .debug_abbrev # Offset Into Abbrev. Section .byte 8 # Address Size (in bytes) .byte 1 # Abbrev [1] 0xb:0x35 DW_TAG_compile_unit .long .Linfo_string0 # DW_AT_producer .short 4 # DW_AT_language .long .Linfo_string1 # DW_AT_name .long .Lline_table_start0 # DW_AT_stmt_list .long .Linfo_string2 # DW_AT_comp_dir # DW_AT_GNU_pubnames .quad .Lfunc_begin0 # DW_AT_low_pc .long .Lfunc_end0-.Lfunc_begin0 # DW_AT_high_pc .byte 2 # Abbrev [2] 0x2a:0x15 DW_TAG_subprogram .quad .Lfunc_begin0 # DW_AT_low_pc .long .Lfunc_end0-.Lfunc_begin0 # DW_AT_high_pc .byte 1 # DW_AT_frame_base .byte 87 .long .Linfo_string3 # DW_AT_name .byte 1 # DW_AT_decl_file .byte 2 # DW_AT_decl_line # DW_AT_external .byte 0 # End Of Children Mark .Ldebug_info_end0: .section .debug_str,"MS",@progbits,1 .Linfo_string0: .asciz " NVC++ 21.5-0" # string offset=0 .Linfo_string1: .asciz "a.c" # string offset=14 .Linfo_string2: .asciz "/home/victoryang" # string offset=18 .Linfo_string3: .asciz "saxpy" # string offset=35 .section .debug_pubnames,"",@progbits .long .LpubNames_end0-.LpubNames_begin0 # Length of Public Names Info .LpubNames_begin0: .short 2 # DWARF Version .long .Lcu_begin0 # Offset of Compilation Unit Info .long 64 # Compilation Unit Length .long 42 # DIE offset .asciz "saxpy" # External Name .long 0 # End Mark .LpubNames_end0: .section .debug_pubtypes,"",@progbits .long .LpubTypes_end0-.LpubTypes_begin0 # Length of Public Types Info .LpubTypes_begin0: .short 2 # DWARF Version .long .Lcu_begin0 # Offset of Compilation Unit Info .long 64 # Compilation Unit Length .long 0 # End Mark .LpubTypes_end0: .section ".note.GNU-stack","",@progbits .section .debug_line,"",@progbits .Lline_table_start0:

GCC arm SVE

对于超算来说应该介绍 Arm FX 64 的。但笔者觉得还是先科普一下 SVE 比较好,说不定下一次 ISC 就有了。

saxpy with neon

// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n saxpy_: ldrswx3, [x3] // x3=*n movx4, #0 // x4=i=0 ldrd0, [x2] // d0=*a b .latch .loop: ldrd1, [x0, x4, lsl #3]// d1=x[i] ldrd2, [x1, x4, lsl #3]// d2=y[i] fmaddd2, d1, d0, d2. // d2+=x[i]*a strd2, [x1, x4, lsl #3]// y[i]=d2 addx4, x4, #1 // i+=1 .latch: cmpx4, x3 // i < n b.lt .loop// more to do? ret

saxpy with sve

// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n saxpy_: ldrswx3, [x3] // x3=*n movx4, #0 // x4=i=0 whileltp0.d, x4, x3 // p0=while(i++<n) ld1rdz0.d, p0/z, [x2]// p0:z0=bcast(*a) .loop: ld1dz1.d, p0/z, [x0, x4, lsl #3]// p0:z1=x[i] ld1dz2.d, p0/z, [x1, x4, lsl #3]// p0:z2=y[i] fmlaz2.d, p0/m, z1.d, z0.d // p0?z2+=x[i]*a st1dz2.d, p0, [x1, x4, lsl #3] // p0?y[i]=z2 incdx4 // i+=(VL/64) .latch: whileltp0.d, x4, x3 // p0=while(i++<n) b.first .loop // more to do? ret

There is no overhead in instruction count for the SVE version when compared to the equivalent scalar code, which allows a compiler to opportunistically vectorize loops withan unknown trip count.

  1. 16个可伸缩预测寄存器(P0-P15):普通的内存和算数操作的控制仅限于P0-P7。但是生成predicate的指令(向量比较)和依赖predicate的指令(逻辑操作)会使用全部寄存器P0-P15。通过分析编译和手动优化,这样的分配方案被验证有效并且减轻了predicate寄存器在其它架构上被观察到的压力

  2. mixed element尺寸控制:每个predicate寄存器允许将最低粒度降低到字节水平,所以每个bit位对应8blt的数据宽度。

  3. predicate条件:在SVE中产生predication的指令是复用NZCV条件码flags,这个NZCV有不同的解释

  4. 隐式顺序:predicate有一个隐式地从最低到最高位元素顺序解释,对应于一个等效的序列顺序。我们引用与此顺序有关的第一个和最后一个predicate elements以及它们的关联条件。

whileltp0 is to predict before the last max alignment which may cause throughput drain. OoO may shot the last element with low occpancy which lead to waste of this shot, alternatively, it could shot lower other (2^n) aligned instruction.

For hazard execution and speculation, you could easily doing gather load with z3 reg fault trap and reload.

OpenMP

The compiler support the openmp by default. The OpenMP standard for specifying threading in programming languages like C and Fortran is implemented in the compiler itself and as such is an integral part of the compiler in question. The OMP and POSIX thread library underneath can vary, but this is normally hidden from the user. OpenMP makes use of POSIX threads so both an OpenMP library and a POSIX thread library is needed. The POSIX thread library is normally supplied with the distribution (typically /usr/lib64/libpthread.so).

 Compiler  Flag to select OpenMP  OpenMP version supported  Intel compilers -qopenmp  From 17.0 on : 4.5 GNU compilers -fopenmp  From GCC 6.1 on : 4.5  PGI compilers mp4.5

You definitely need to watch Fanrui's PPT and understand the implementation of OpenMP in Clang.

Ref

  1. https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/lecture-slides/MIT6_172F18_lec9.pdf
  2. 程序员的自我修养
  3. 不同编译器的编译行为比较
  4. The ARM Scalable Vector Extension
  5. https://www.stonybrook.edu/commcms/ookami/support/_docs/1%20-%20Intro%20to%20A64FX.pdf
  6. https://llvm.org/devmtg/2017-10/slides/Ferguson-Enabling%20Parallel%20Computing%20in%20Chapel.pdf

Chisel

有关我们是否有必要学习一门逻辑电路描述语言去实现一个 CPU 或者一个路由器? 我认为是十分必要的,这让大家可以从底往上看你的 Data 和 Instruction 的变动。

Fortran

这个语言简直伤天害理,但确是笔者前老板最爱的语言,而且是f77,天灭fortran。可超算比赛 fortran 的上镜次数还挺多的,之前做天气应用的时候没敢改,现在既然有个小于 15w 行的程序,笔者尝试着修改。

PGI 编译器是一个商用编译器。后被 NV 收购,加了很多 fortran 可用的 cuda DSL。这无疑让 fortran 续命了不少。NVHPC 中的 Nvfortran 有很多编译器优化的 log 可以看。

基本语法: module 相当于 c 中的struct。 program(main)/function(normal function) 相当于 对function 的定义

real function square(x) implicit none real, intent(in) :: x square = x * x return end function square program main integer :: n, i, errs, argcount real, dimension(:), allocatable :: a, b, r, e n = 1000000 call square(n) end program

subroutine 相当于trait,需要有generic function 来实现

OpenACC

一个简单的加法

module mpoint type point real :: x, y, z end type type(point) :: base(1000) end module subroutine vecaddgpu( r, n ) use mpoint type(point) :: r(:) integer :: n !$acc parallel loop present(base) copyout(r(:)) do i = 1, n r(i)%x = base(i)%x r(i)%y = sqrt( base(i)%y*base(i)%y + base(i)%z*base(i)%z ) r(i)%z = 0 enddo end subroutine

Mind to use Makefile to see the Optimization info from the compiler. Also checkt the identifier loop present and copyout specify the gpu to run on.

nvfortran -MInfo -Mbounds

During the runtime, you can see the symbol and source file and using which GPU.

NVCOMPILER_ACC_NOTIFY=1 /root/yyw/cmake-openacc/cmake-build-debug-nvhpc/acc_test

Let's compared with the Kernel version. Both option -O0 -g

#include <iostream> #include <cassert> #include <cuda_runtime.h> __global__ void vecaddgpu(int **a, int **b, int **c, int i) { *c[i] = *a[i] + *b[i]; } int main(void) { int n = 1000000000; int *a = static_cast<int *>(malloc(n * sizeof(int))); int *b = static_cast<int *>(malloc(n * sizeof(int))); int *c = static_cast<int *>(malloc(n * sizeof(int))); // host copies of a, b, c int *e = static_cast<int *>(malloc(n * sizeof(int))); // result int **d_a, **d_b, **d_c; // device copies of a, b, c int size = sizeof(int); int err = 0; for (int i = 0; i < n; i++) { a[i] = i; b[i] = 1000 * i; e[i] = a[i] + b[i]; } // Allocate space for device copies of a, b, c cudaMalloc((void **) &d_a, size * n); cudaMalloc((void **) &d_b, size * n); cudaMalloc((void **) &d_c, size * n); // Copy inputs to device cudaMemcpy(d_a, reinterpret_cast<const void *>(a), size * n, cudaMemcpyHostToDevice); cudaMemcpy(d_b, reinterpret_cast<const void *>(b), size * n, cudaMemcpyHostToDevice); // Launch vecaddgpu() kernel on GPU with N blocks vecaddgpu<<<1, 1024>>>(d_a, d_b, d_c, n); // Copy result back to host cudaMemcpy(c, d_c, size * n, cudaMemcpyDeviceToHost); // Cleanup for (int i = 0; i < n; i++) { if (c[i] != e[i]) err++; } free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }

效率对比

pure cuda kernel is 1.5x faster.

The compiler option between different compiler

Go

此语言是个非常简单上手的语言,同时由于大厂使用过多,市面上开源的工具都非常可用。笔者在实习的时候学习的,module 等包管理工具在并行文件系统上面的 channel + context 重新实现非常快,那个代码也就数十行处理一些并行质询的状态机即可,也用其写过一些 eBPF 的代码用于更好的获得文件 IO 的实时性能。

一些小工具

Rust

Rust 是好活,自从我校2016年以 Rust 语言开设 CS100(程序语言设计)开始,上科大就成为了宣传Rust的堡垒,中国 Rust 之父张汉东先生及宣发 Rust 的各界人士选择推广 Rust 的最佳地点就会选择上科大,这雨与 riscv 类似。写小的bash工具( Zero Cost Abstraction 的 cffi )、写大型(10w+ line)的系统方向程序必备。由于语言特性有很多的静态检查,会指导大家对于内存管理,异步编程有更深刻的理解。

学习参考资料

  1. rCore - 清华维护的教学操作系统
  2. Libra - 脸书维护的区块链数据库
  3. 飞书 - Tokio 代码池

异步中的SyncSend

https://kaisery.github.io/trpl-zh-cn/ch16-04-extensible-concurrency-sync-and-send.html

Send

use std::rc::Rc; use std::sync::Mutex; use std::thread; fn main() { let num = Rc::new(Mutex::new(0)); let mut handlers = vec![]; for i in 1..10 { let num_copy = num.clone(); let handle = thread::spawn(move || { *num_copy.lock().unwrap() += 1; }); handlers.push(handle); } for handler in handlers { handler.join(); } println!("{}", num.lock().unwrap()); }

这段代目不能通过编译, 原因是num_copymove到线程中时,可能会多线程同时修改引用计数。所以在Rust中,Rc没有Send trait,因为它不允许在线程间转移所有权。

error[E0277]: `Rc<Mutex<i32>>` cannot be sent between threads safely --> src/main.rs:10:22 | 10 | let handle = thread::spawn(move || { | ______________________^^^^^^^^^^^^^_- | | | | | `Rc<Mutex<i32>>` cannot be sent between threads safely 11 | | *num_copy.lock().unwrap() += 1; 12 | | }); | |_________- within this `[closure@src/main.rs:10:36: 12:10]` | ::: /home/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:624:8 | 624 | F: Send + 'static, | ---- required by this bound in `spawn` | = help: within `[closure@src/main.rs:10:36: 12:10]`, the trait `Send` is not implemented for `Rc<Mutex<i32>>` = note: required because it appears within the type `[closure@src/main.rs:10:36: 12:10]`

Sync

use std::cell::Cell; use std::thread; fn main() { let num = Cell::new(0); let mut handlers = vec![]; for i in 1..10 { let handle = thread::spawn(|| { num.set(i); }); handlers.push(handle); } for h in handlers { h.join(); } println!("{}", num.get()); }

这段代目不能通过编译, 原因是Cell<T> 我们在多个线程中共享&Cell<T>时, 多可线程可以并发地修改其内部的值,这并不安全。所以Cell<T> 实现了!Sync trait。

error[E0277]: `Cell<i32>` cannot be shared between threads safely --> src/main.rs:8:22 | 8 | let handle = thread::spawn(|| { | ^^^^^^^^^^^^^ `Cell<i32>` cannot be shared between threads safely | ::: /home/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:624:8 | 624 | F: Send + 'static, | ---- required by this bound in `spawn` | = help: the trait `Sync` is not implemented for `Cell<i32>` = note: required because of the requirements on the impl of `Send` for `&Cell<i32>` = note: required because it appears within the type `[closure@src/main.rs:8:36: 10:10]`

Libs

这里放置一些常用库的安装和使用事项。

SVML

fuck numa in linux but fine in epyc

Every 2.0s: numastat epyc.node1: Mon Aug 30 07:17:40 2021 node0 node1 numa_hit 11605557098 17090418391 numa_miss 0 0 numa_foreign 0 0 interleave_hit 83929 83526 local_node 11605248266 17089868634 other_node 308832 549757

Boost

website

Spack

spack info boost spack install boost

Source

./bootstrap.sh --help # Select your configuration options and invoke ./bootstrap.sh again without the --help option. Unless you have write permission in your system's /usr/local/ directory, you'll probably want to at least use ./bootstrap.sh --prefix=path/to/installation/prefix # to install somewhere else. Also, consider using the --show-libraries and --with-libraries=library-name-list options to limit the long wait you'll experience if you build everything. Finally, ./b2 install # will leave Boost binaries in the lib/ subdirectory of your installation prefix. You will also find a copy of the Boost headers in the include/ subdirectory of the installation prefix, so you can henceforth use that directory as an #include path in place of the Boost root directory. # and add to PATH and LD and INCLUDE

版本相关问题

This is Version 3 of the Filesystem library. Version 2 is not longer supported. 1.49.0 was the last release of Boost to supply Version 2。

ArmForge

Arm Forge 是一个 Arm 公司出品的对高性能程序的软件。最强大的地方就是他对 CPU GPU 都非常适用,包括 Arm DDT 和 Arm MAP 的工具。Arm DDT 是业界领先的并行调试器,支持 MPI、CUDA 和 OpenMP。Arm MAP是用于MPI、OpenMP和 Vectorized 程序的低开销线级剖析器。

uProf

AMD 出品的一款 perf 工具,增加了一些 X86 超集的 metrics,但 UI 比较丑。

x86 subset

You can refer to the CONFIG_X86_AMD_PSTATE tuning on ZenStates-Linux. The perf subset could be found in the linux perf tools. Also check /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver to see if the governor is available. The feature is switched on from Linux 5.17 (Ubuntu 22.04).

Reference

  1. https://faculty.sites.uci.edu/zhouli/files/2022/01/oakland22.pdf
  2. https://indico.cern.ch/event/730908/contributions/3153163/attachments/1730954/2810149/epyc.pdf
  3. https://www.nextplatform.com/2019/08/15/a-deep-dive-into-amds-rome-epyc-architecture/
  4. https://github.com/FPSG-UIUC/lotr

Vtune

都说做体系结构的人最懂做 Profiling,一个好的 Profiling 工具一定是有好的对 CPU 实时性能的抽象,最简单的命令是 rdstc, arm上也有类似的实现

ABI 支持

profiler 需要有对 Intel Proceccor 各种 metrics 的函数实现,在安装Vtune的过程中会编译对当前系统的 PMU 的动态链接库,意为对 Intel PMU 的 ABI支持,我队使用的 epyc 有epyc适用的魔改版 PMU Tools。现在体系结构安全界学术主流对 PMU 的研究很深,因为其泄露了部分对 CPU 实时的状态,可以从中获取想要的东西。

X86 需要支持的 perf 参数比较有限,linux从kprobe,uprobe这些官方支持的 microprocessor sampling有很有用的,这些已被eBPF所采用。

Intel Compiler 在编译 broadwell 以上架构优化时主要做了三件对性能影响很大的事情:

  1. 激进的跨 basic block 优化 + Vectorization + Loop Unroll
  2. Load 和 Store 在满足 TSO 条件下的激进的重排,同时激进的整合数据,支持 store buffer bypass,movnt。同时也是 icc 后端 Bug 的主要来源。也是大厂不太用他的原因。除 HPC 外,大家一般照 gcc 标准。
  3. 自己维护的 TBB 线程池(非常快),自己维护的 malloc_align,自己维护的相关库。

有关如何更好的适配 Intel 的CPU,可以参考 Lammps 的 Intel Package。其使用访问者模式对 Intel processor的寄存器资源。

有关对用户态文件系统的适配,Vtune 提供了对 PM 带宽的实时测试,这个 metrics 貌似很难拿到。

在集群上如何使用

spack load intel-parallel-studio # choose the right version amplxe-cl

Spack 简单教程

❯ spack info gcc AutotoolsPackage: gcc Description: The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Ada, and Go, as well as libraries for these languages. Homepage: https://gcc.gnu.org Maintainers: @michaelkuhn @alalazo Externally Detectable: True (version, variants) Tags: None Preferred version: 11.2.0 https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz Safe versions: master [git] git://gcc.gnu.org/git/gcc.git on branch master 11.2.0 https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz 11.1.0 https://ftpmirror.gnu.org/gcc/gcc-11.1.0/gcc-11.1.0.tar.xz 10.3.0 https://ftpmirror.gnu.org/gcc/gcc-10.3.0/gcc-10.3.0.tar.xz 10.2.0 https://ftpmirror.gnu.org/gcc/gcc-10.2.0/gcc-10.2.0.tar.xz 10.1.0 https://ftpmirror.gnu.org/gcc/gcc-10.1.0/gcc-10.1.0.tar.xz 9.4.0 https://ftpmirror.gnu.org/gcc/gcc-9.4.0/gcc-9.4.0.tar.xz 9.3.0 https://ftpmirror.gnu.org/gcc/gcc-9.3.0/gcc-9.3.0.tar.xz 9.2.0 https://ftpmirror.gnu.org/gcc/gcc-9.2.0/gcc-9.2.0.tar.xz 9.1.0 https://ftpmirror.gnu.org/gcc/gcc-9.1.0/gcc-9.1.0.tar.xz 8.5.0 https://ftpmirror.gnu.org/gcc/gcc-8.5.0/gcc-8.5.0.tar.xz 8.4.0 https://ftpmirror.gnu.org/gcc/gcc-8.4.0/gcc-8.4.0.tar.xz 8.3.0 https://ftpmirror.gnu.org/gcc/gcc-8.3.0/gcc-8.3.0.tar.xz 8.2.0 https://ftpmirror.gnu.org/gcc/gcc-8.2.0/gcc-8.2.0.tar.xz 8.1.0 https://ftpmirror.gnu.org/gcc/gcc-8.1.0/gcc-8.1.0.tar.xz 7.5.0 https://ftpmirror.gnu.org/gcc/gcc-7.5.0/gcc-7.5.0.tar.xz 7.4.0 https://ftpmirror.gnu.org/gcc/gcc-7.4.0/gcc-7.4.0.tar.xz 7.3.0 https://ftpmirror.gnu.org/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz 7.2.0 https://ftpmirror.gnu.org/gcc/gcc-7.2.0/gcc-7.2.0.tar.xz 7.1.0 https://ftpmirror.gnu.org/gcc/gcc-7.1.0/gcc-7.1.0.tar.bz2 6.5.0 https://ftpmirror.gnu.org/gcc/gcc-6.5.0/gcc-6.5.0.tar.bz2 6.4.0 https://ftpmirror.gnu.org/gcc/gcc-6.4.0/gcc-6.4.0.tar.bz2 6.3.0 https://ftpmirror.gnu.org/gcc/gcc-6.3.0/gcc-6.3.0.tar.bz2 6.2.0 https://ftpmirror.gnu.org/gcc/gcc-6.2.0/gcc-6.2.0.tar.bz2 6.1.0 https://ftpmirror.gnu.org/gcc/gcc-6.1.0/gcc-6.1.0.tar.bz2 5.5.0 https://ftpmirror.gnu.org/gcc/gcc-5.5.0/gcc-5.5.0.tar.bz2 5.4.0 https://ftpmirror.gnu.org/gcc/gcc-5.4.0/gcc-5.4.0.tar.bz2 5.3.0 https://ftpmirror.gnu.org/gcc/gcc-5.3.0/gcc-5.3.0.tar.bz2 5.2.0 https://ftpmirror.gnu.org/gcc/gcc-5.2.0/gcc-5.2.0.tar.bz2 5.1.0 https://ftpmirror.gnu.org/gcc/gcc-5.1.0/gcc-5.1.0.tar.bz2 4.9.4 https://ftpmirror.gnu.org/gcc/gcc-4.9.4/gcc-4.9.4.tar.bz2 4.9.3 https://ftpmirror.gnu.org/gcc/gcc-4.9.3/gcc-4.9.3.tar.bz2 4.9.2 https://ftpmirror.gnu.org/gcc/gcc-4.9.2/gcc-4.9.2.tar.bz2 4.9.1 https://ftpmirror.gnu.org/gcc/gcc-4.9.1/gcc-4.9.1.tar.bz2 4.8.5 https://ftpmirror.gnu.org/gcc/gcc-4.8.5/gcc-4.8.5.tar.bz2 4.8.4 https://ftpmirror.gnu.org/gcc/gcc-4.8.4/gcc-4.8.4.tar.bz2 4.7.4 https://ftpmirror.gnu.org/gcc/gcc-4.7.4/gcc-4.7.4.tar.bz2 4.6.4 https://ftpmirror.gnu.org/gcc/gcc-4.6.4/gcc-4.6.4.tar.bz2 4.5.4 https://ftpmirror.gnu.org/gcc/gcc-4.5.4/gcc-4.5.4.tar.bz2 Variants: Name [Default] Allowed values Description ========================= ==================== =================================================== binutils [off] on, off Build via binutils bootstrap [on] on, off Enable 3-stage bootstrap graphite [off] on, off Enable Graphite loop optimizations (requires ISL) languages [c,c++,fortran] ada, brig, c, c++, Compilers and runtime libraries to build fortran, go, java, jit, lto, objc, obj-c++ nvptx [off] on, off Target nvptx offloading to NVIDIA GPUs piclibs [off] on, off Build PIC versions of libgfortran.a and libstdc++.a strip [off] on, off Strip executables to reduce installation size Installation Phases: autoreconf configure build install Build Dependencies: binutils cuda diffutils flex gmp gnat iconv isl mpc mpfr zip zlib zstd Link Dependencies: binutils cuda gmp gnat iconv isl mpc mpfr zlib zstd Run Dependencies: binutils Virtual Packages: gcc@7: languages=go provides golang@:1.8 gcc@6: languages=go provides golang@:1.6.1 gcc@5: languages=go provides golang@:1.4 gcc@4.9: languages=go provides golang@:1.2 gcc@4.8.2: languages=go provides golang@:1.1.2 gcc@4.8: languages=go provides golang@:1.1 gcc@4.7.1: languages=go provides golang@:1 gcc@4.6: languages=go provides golang

得到相关依赖,可以查看你现在如果安装 spack 提供的依赖

❯ spack spec gcc Input spec -------------------------------- gcc Concretized -------------------------------- gcc@11.2.0%apple-clang@12.0.5~binutils+bootstrap~graphite~nvptx~piclibs~strip languages=c,c++,fortran patches=ecc5ac43951b34cbc5db15f585b4e704c42e2e487f9ed4c24fadef3f3857930b arch=darwin-bigsur-skylake ^diffutils@2.8.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^gmp@6.2.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^autoconf@2.71%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^automake@1.16.4%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^libtool@2.4.6%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^m4@1.4.6%apple-clang@12.0.5+sigsegv patches=c0a408fbffb7255fcc75e26bd8edab116fc81d216bfd18b473668b7739a4158e arch=darwin-bigsur-skylake ^libiconv@1.16%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^mpc@1.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^mpfr@4.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^autoconf-archive@2019.01.06%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^texinfo@4.8%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^zlib@1.2.11%apple-clang@12.0.5+optimize+pic+shared arch=darwin-bigsur-skylake ^zstd@1.5.0%apple-clang@12.0.5~ipo~legacy~lz4~lzma~multithread+programs+shared+static~zlib build_type=RelWithDebInfo arch=darwin-bigsur-skylake ^cmake@3.21.1%apple-clang@12.0.5~doc+ncurses+openssl+ownlibs~qt build_type=Release arch=darwin-bigsur-skylake

如果想使用特定依赖或者依赖系统的包,会在 ~/.spack/package 下得到,其使用方法可见 AMD

❯ spack external find ❯ cat ~/.spack/package.json │ File: /Users/victoryang/.spack/packages.yaml ───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ packages: 2 │ autoconf: 3 │ externals: 4 │ - spec: autoconf@2.71 5 │ prefix: /usr/local 6 │ automake: 7 │ externals: 8 │ - spec: automake@1.16.4 9 │ prefix: /usr/local 10 │ bash: 11 │ externals: 12 │ - spec: bash@3.2.57 13 │ prefix: / 14 │ bazel: 15 │ externals: 16 │ - spec: bazel@4.1.0 17 │ prefix: /usr/local 18 │ bison: 19 │ externals: 20 │ - spec: bison@2.3 21 │ prefix: /usr 22 │ bzip2: 23 │ externals: 24 │ - spec: bzip2@1.0.6 25 │ prefix: /usr 26 │ cmake: 27 │ externals: 28 │ - spec: cmake@3.21.1 29 │ prefix: /usr/local 30 │ diffutils: 31 │ externals: 32 │ - spec: diffutils@2.8.1 33 │ prefix: /usr ...

安装后的参数大致需要的有几个,-j N 是 job 个数,--no-checksum 不检查文件 md5,--no-restage 为在修改过的文件后继续编译,一般在 /tmp/root/spack-stage/spack-stage-amdscalapack-3.0-qwvyrumhsizxiaujwdsppcovijr5k5ri/spack-src/. 有些包有 cflags cxxflags fcflags。 有些有 cuda_arch。碰到新的软件可以把所有需要的参数 append 到里面。

❯ spack install -j 8 --no-checksum llvm+mlir+flang+all_targets+python+shared_libs cflags="-O3" cxxflags="-O3" [+] /usr/local (external cmake-3.21.1-cdhzbrts4k5ylrvlpspfl75zgeht4swi) [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libiconv-1.16-ropgshv657ooz7kfzojv4s6srscgimnw [+] /usr/local (external pkg-config-0.29.2-4nv7fo7lbjybt2u3xzb2vxzvgvaz5xmw) [+] /usr/local (external xz-5.2.5-p37wr6fna4ysoh2xn2wnmmzttm3bi37o) [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/zlib-1.2.11-lci2s4zd6x77rmexa3uuarbl5cvneskw [+] /usr (external perl-5.30.2-4zkfgqml35km4ly7xmxn7ooz44dxtgqp) [+] /usr/local (external python-3.9.6-shbb7dthsqe4lu26jugddyi2k7pl3jbl) [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/pcre-8.44-g4df4jqpudoxhjsrubrqhv3uwxajofet [+] /usr/local (external z3-4.8.12-hvhfxnxuachtpi524zf55znqn55vanod) [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/ncurses-6.2-xilcz3bhw4otebvysduddyldezxhxvy6 [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libxml2-2.9.10-mlrnjcbnjt3w7635xrietes7terwhko6 [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/perl-data-dumper-2.173-cv4kwshixb7tmk6p7icxrqpicppkx5gr [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-setuptools-50.3.2-hwyhyijgi3yjokddm67tb6aulefteudx [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/swig-4.0.2-vajpijk4isacr52dzgk2gqbvyunadwkc [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libedit-3.1-20210216-6h4xokftdnxe2h3o7tie2cnbzbhfrr4h [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/hwloc-2.5.0-z2brjfcvnend5gorjmeqqgirccqerdwd [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-six-1.15.0-c63zkkdjpvegqai2f4jjg4mutsuchoov [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/llvm-12.0.1-n6c5z7sqfo7olnaqswu7jqhcdkyyk6nh

以依赖 nvhpc 编译器的 hdf5 为例。笔者碰到的问题有给定 mpi 找不到对应 wrapper 的 nvc 以及 nvfortran。在手动编译的时候会在 PATH 里面找 cc,或者直接找 CC,FC,CXX。这时候需要定义一个 FC。

if spec.compiler.fc="nvfortran": env.set("FC","/path/to/mpifort wrapper") args.append(("CMAKE_Fortran_COMPILER",spec.complier.mpifort))

在超算上的 spack 有两个 upstream,你觉得重要可以直接给原 repo PR,一般我们备份到学校内部的 gitlab。

Spack 与 Modules

当系统有 Modules 时,会自动把 Module file 的目录加到 MANPATH 下,也即立即可以使用 module load

有关Spack错误记录

$ spack load boost@1.70 ==> Error: No compilers for operating system debian10 satisfy spec gcc@10.2.0

当出现这种错误时,可以检查一下 .spack/linux/compilers.yaml 是否包含了spack中的所有compiler。

Package

原始环境中常常有各种环境变量,这些环境变量在给 Spack 打包的时候可能会有影响,因此本 SOP 提出了一种用 docker 做环境隔离的方式,以便在不影响原始环境的情况下,可以在不同的环境中运行 Spack 的打包。

How to run

$ cd <Dockerfile-Folder> $ docker build -t <image-name> . $ docker run -it -d -v <spack-folder-your-machine>:<spack-folder-your-machine> --name <docker-name> <image-name> $ docker exec -it <docker-name> bash

Dockerfile ( Long-Term Support )

FROM ubuntu:20.04 RUN apt-get update RUN apt-get install -y ca-certificates RUN sed -i "s/archive.ubuntu.com/mirrors.shanghaitech.edu.cn/g" /etc/apt/sources.list RUN apt-get update RUN apt-get install -y python python3 gcc build-essential wget nano vim gfortran curl less libnl-nf-3-200

Spack 简单教程

❯ spack info gcc AutotoolsPackage: gcc Description: The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Ada, and Go, as well as libraries for these languages. Homepage: https://gcc.gnu.org Maintainers: @michaelkuhn @alalazo Externally Detectable: True (version, variants) Tags: None Preferred version: 11.2.0 https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz Safe versions: master [git] git://gcc.gnu.org/git/gcc.git on branch master 11.2.0 https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz 11.1.0 https://ftpmirror.gnu.org/gcc/gcc-11.1.0/gcc-11.1.0.tar.xz 10.3.0 https://ftpmirror.gnu.org/gcc/gcc-10.3.0/gcc-10.3.0.tar.xz 10.2.0 https://ftpmirror.gnu.org/gcc/gcc-10.2.0/gcc-10.2.0.tar.xz 10.1.0 https://ftpmirror.gnu.org/gcc/gcc-10.1.0/gcc-10.1.0.tar.xz 9.4.0 https://ftpmirror.gnu.org/gcc/gcc-9.4.0/gcc-9.4.0.tar.xz 9.3.0 https://ftpmirror.gnu.org/gcc/gcc-9.3.0/gcc-9.3.0.tar.xz 9.2.0 https://ftpmirror.gnu.org/gcc/gcc-9.2.0/gcc-9.2.0.tar.xz 9.1.0 https://ftpmirror.gnu.org/gcc/gcc-9.1.0/gcc-9.1.0.tar.xz 8.5.0 https://ftpmirror.gnu.org/gcc/gcc-8.5.0/gcc-8.5.0.tar.xz 8.4.0 https://ftpmirror.gnu.org/gcc/gcc-8.4.0/gcc-8.4.0.tar.xz 8.3.0 https://ftpmirror.gnu.org/gcc/gcc-8.3.0/gcc-8.3.0.tar.xz 8.2.0 https://ftpmirror.gnu.org/gcc/gcc-8.2.0/gcc-8.2.0.tar.xz 8.1.0 https://ftpmirror.gnu.org/gcc/gcc-8.1.0/gcc-8.1.0.tar.xz 7.5.0 https://ftpmirror.gnu.org/gcc/gcc-7.5.0/gcc-7.5.0.tar.xz 7.4.0 https://ftpmirror.gnu.org/gcc/gcc-7.4.0/gcc-7.4.0.tar.xz 7.3.0 https://ftpmirror.gnu.org/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz 7.2.0 https://ftpmirror.gnu.org/gcc/gcc-7.2.0/gcc-7.2.0.tar.xz 7.1.0 https://ftpmirror.gnu.org/gcc/gcc-7.1.0/gcc-7.1.0.tar.bz2 6.5.0 https://ftpmirror.gnu.org/gcc/gcc-6.5.0/gcc-6.5.0.tar.bz2 6.4.0 https://ftpmirror.gnu.org/gcc/gcc-6.4.0/gcc-6.4.0.tar.bz2 6.3.0 https://ftpmirror.gnu.org/gcc/gcc-6.3.0/gcc-6.3.0.tar.bz2 6.2.0 https://ftpmirror.gnu.org/gcc/gcc-6.2.0/gcc-6.2.0.tar.bz2 6.1.0 https://ftpmirror.gnu.org/gcc/gcc-6.1.0/gcc-6.1.0.tar.bz2 5.5.0 https://ftpmirror.gnu.org/gcc/gcc-5.5.0/gcc-5.5.0.tar.bz2 5.4.0 https://ftpmirror.gnu.org/gcc/gcc-5.4.0/gcc-5.4.0.tar.bz2 5.3.0 https://ftpmirror.gnu.org/gcc/gcc-5.3.0/gcc-5.3.0.tar.bz2 5.2.0 https://ftpmirror.gnu.org/gcc/gcc-5.2.0/gcc-5.2.0.tar.bz2 5.1.0 https://ftpmirror.gnu.org/gcc/gcc-5.1.0/gcc-5.1.0.tar.bz2 4.9.4 https://ftpmirror.gnu.org/gcc/gcc-4.9.4/gcc-4.9.4.tar.bz2 4.9.3 https://ftpmirror.gnu.org/gcc/gcc-4.9.3/gcc-4.9.3.tar.bz2 4.9.2 https://ftpmirror.gnu.org/gcc/gcc-4.9.2/gcc-4.9.2.tar.bz2 4.9.1 https://ftpmirror.gnu.org/gcc/gcc-4.9.1/gcc-4.9.1.tar.bz2 4.8.5 https://ftpmirror.gnu.org/gcc/gcc-4.8.5/gcc-4.8.5.tar.bz2 4.8.4 https://ftpmirror.gnu.org/gcc/gcc-4.8.4/gcc-4.8.4.tar.bz2 4.7.4 https://ftpmirror.gnu.org/gcc/gcc-4.7.4/gcc-4.7.4.tar.bz2 4.6.4 https://ftpmirror.gnu.org/gcc/gcc-4.6.4/gcc-4.6.4.tar.bz2 4.5.4 https://ftpmirror.gnu.org/gcc/gcc-4.5.4/gcc-4.5.4.tar.bz2 Variants: Name [Default] Allowed values Description ========================= ==================== =================================================== binutils [off] on, off Build via binutils bootstrap [on] on, off Enable 3-stage bootstrap graphite [off] on, off Enable Graphite loop optimizations (requires ISL) languages [c,c++,fortran] ada, brig, c, c++, Compilers and runtime libraries to build fortran, go, java, jit, lto, objc, obj-c++ nvptx [off] on, off Target nvptx offloading to NVIDIA GPUs piclibs [off] on, off Build PIC versions of libgfortran.a and libstdc++.a strip [off] on, off Strip executables to reduce installation size Installation Phases: autoreconf configure build install Build Dependencies: binutils cuda diffutils flex gmp gnat iconv isl mpc mpfr zip zlib zstd Link Dependencies: binutils cuda gmp gnat iconv isl mpc mpfr zlib zstd Run Dependencies: binutils Virtual Packages: gcc@7: languages=go provides golang@:1.8 gcc@6: languages=go provides golang@:1.6.1 gcc@5: languages=go provides golang@:1.4 gcc@4.9: languages=go provides golang@:1.2 gcc@4.8.2: languages=go provides golang@:1.1.2 gcc@4.8: languages=go provides golang@:1.1 gcc@4.7.1: languages=go provides golang@:1 gcc@4.6: languages=go provides golang

得到相关依赖,可以查看你现在如果安装 spack 提供的依赖

❯ spack spec gcc Input spec -------------------------------- gcc Concretized -------------------------------- gcc@11.2.0%apple-clang@12.0.5~binutils+bootstrap~graphite~nvptx~piclibs~strip languages=c,c++,fortran patches=ecc5ac43951b34cbc5db15f585b4e704c42e2e487f9ed4c24fadef3f3857930b arch=darwin-bigsur-skylake ^diffutils@2.8.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^gmp@6.2.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^autoconf@2.71%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^automake@1.16.4%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^libtool@2.4.6%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^m4@1.4.6%apple-clang@12.0.5+sigsegv patches=c0a408fbffb7255fcc75e26bd8edab116fc81d216bfd18b473668b7739a4158e arch=darwin-bigsur-skylake ^libiconv@1.16%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^mpc@1.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^mpfr@4.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^autoconf-archive@2019.01.06%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^texinfo@4.8%apple-clang@12.0.5 arch=darwin-bigsur-skylake ^zlib@1.2.11%apple-clang@12.0.5+optimize+pic+shared arch=darwin-bigsur-skylake ^zstd@1.5.0%apple-clang@12.0.5~ipo~legacy~lz4~lzma~multithread+programs+shared+static~zlib build_type=RelWithDebInfo arch=darwin-bigsur-skylake ^cmake@3.21.1%apple-clang@12.0.5~doc+ncurses+openssl+ownlibs~qt build_type=Release arch=darwin-bigsur-skylake

如果想使用特定依赖或者依赖系统的包,会在 ~/.spack/package 下得到,其使用方法可见 AMD

❯ spack external find ❯ cat ~/.spack/package.json │ File: /Users/victoryang/.spack/packages.yaml ───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ packages: 2 │ autoconf: 3 │ externals: 4 │ - spec: autoconf@2.71 5 │ prefix: /usr/local 6 │ automake: 7 │ externals: 8 │ - spec: automake@1.16.4 9 │ prefix: /usr/local 10 │ bash: 11 │ externals: 12 │ - spec: bash@3.2.57 13 │ prefix: / 14 │ bazel: 15 │ externals: 16 │ - spec: bazel@4.1.0 17 │ prefix: /usr/local 18 │ bison: 19 │ externals: 20 │ - spec: bison@2.3 21 │ prefix: /usr 22 │ bzip2: 23 │ externals: 24 │ - spec: bzip2@1.0.6 25 │ prefix: /usr 26 │ cmake: 27 │ externals: 28 │ - spec: cmake@3.21.1 29 │ prefix: /usr/local 30 │ diffutils: 31 │ externals: 32 │ - spec: diffutils@2.8.1 33 │ prefix: /usr ...

安装后的参数大致需要的有几个,-j N 是 job 个数,--no-checksum 不检查文件 md5,--no-restage 为在修改过的文件后继续编译,一般在 /tmp/root/spack-stage/spack-stage-amdscalapack-3.0-qwvyrumhsizxiaujwdsppcovijr5k5ri/spack-src/. 有些包有 cflags cxxflags fcflags。 有些有 cuda_arch。碰到新的软件可以把所有需要的参数 append 到里面。

❯ spack install -j 8 --no-checksum llvm+mlir+flang+all_targets+python+shared_libs cflags="-O3" cxxflags="-O3" [+] /usr/local (external cmake-3.21.1-cdhzbrts4k5ylrvlpspfl75zgeht4swi) [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libiconv-1.16-ropgshv657ooz7kfzojv4s6srscgimnw [+] /usr/local (external pkg-config-0.29.2-4nv7fo7lbjybt2u3xzb2vxzvgvaz5xmw) [+] /usr/local (external xz-5.2.5-p37wr6fna4ysoh2xn2wnmmzttm3bi37o) [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/zlib-1.2.11-lci2s4zd6x77rmexa3uuarbl5cvneskw [+] /usr (external perl-5.30.2-4zkfgqml35km4ly7xmxn7ooz44dxtgqp) [+] /usr/local (external python-3.9.6-shbb7dthsqe4lu26jugddyi2k7pl3jbl) [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/pcre-8.44-g4df4jqpudoxhjsrubrqhv3uwxajofet [+] /usr/local (external z3-4.8.12-hvhfxnxuachtpi524zf55znqn55vanod) [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/ncurses-6.2-xilcz3bhw4otebvysduddyldezxhxvy6 [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libxml2-2.9.10-mlrnjcbnjt3w7635xrietes7terwhko6 [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/perl-data-dumper-2.173-cv4kwshixb7tmk6p7icxrqpicppkx5gr [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-setuptools-50.3.2-hwyhyijgi3yjokddm67tb6aulefteudx [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/swig-4.0.2-vajpijk4isacr52dzgk2gqbvyunadwkc [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libedit-3.1-20210216-6h4xokftdnxe2h3o7tie2cnbzbhfrr4h [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/hwloc-2.5.0-z2brjfcvnend5gorjmeqqgirccqerdwd [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-six-1.15.0-c63zkkdjpvegqai2f4jjg4mutsuchoov [+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/llvm-12.0.1-n6c5z7sqfo7olnaqswu7jqhcdkyyk6nh

以依赖 nvhpc 编译器的 hdf5 为例。笔者碰到的问题有给定 mpi 找不到对应 wrapper 的 nvc 以及 nvfortran。在手动编译的时候会在 PATH 里面找 cc,或者直接找 CC,FC,CXX。这时候需要定义一个 FC。

if spec.compiler.fc="nvfortran": env.set("FC","/path/to/mpifort wrapper") args.append(("CMAKE_Fortran_COMPILER",spec.complier.mpifort))

在超算上的 spack 有两个 upstream,你觉得重要可以直接给原 repo PR,一般我们备份到学校内部的 gitlab。

Spack 与 Modules

当系统有 Modules 时,会自动把 Module file 的目录加到 MANPATH 下,也即立即可以使用 module load

有关Spack错误记录

$ spack load boost@1.70 ==> Error: No compilers for operating system debian10 satisfy spec gcc@10.2.0

当出现这种错误时,可以检查一下 .spack/linux/compilers.yaml 是否包含了spack中的所有compiler。

Debug a package

spack cd <spec> spack build-env <spec> <shell/command>

Architecture

这里放置有关计算机体系结构的资料

Memory Model

Memory Coherence

Memory coherence: a memory system is coherent if any read of a data item returns the most recently written value of that data item.

Coherent memory system:

  1. For a memory address wrote by a processor P, the next reads of P should get the written value.
  2. For a memory address wrote by a processor P1, after enough time, another processor P2 can get the value written by P1.
  3. The write operations for one memory address are serialized, so if there are 2 writes to a memory address by any processor, any processor cannot get two results in different order.

The coherence model does not define when a wrote by P1 can be read by P2. The memory consistency model is responsible for it.

Memory Consistency

Memory consistency: A memory consistency model for a shared address space specifies constraints on the order in which memory operations must appear to be performed (i.e. to become visible to the processors) with respect to one another.(when a written value will be returned/seen by a read).

The memory consistency model defines the order of operation pairs in different addresses.

Sequential consistency model

  1. In each processor, the read operation should always get the value of the last write operation in program order.
# Processor 1 Flag1 = 1 if (Flag2 == 0) do sth # Processor 2 Flag2 = 1 if (Flag1 == 0) do sth

For P1, SC can guarantee that if the value of Flag2 is 0, the write of Flag1 happens before the write and read of P2. So there is at most one processor is in the do sth section (Neither processor failed to get in the critical section is also possible).

  1. There is only one order visible to all processors. For two write operations W1, W2 (can be done by different processors), each processors should get the same sequence.
# Processor 1 A = 1 # Processor 2 if (A == 1) B = 1 # Processor 3 if (B == 1) get(A)

If P3 gets the B == 1, the value of A must be 1. Since the write sequence seen by P2 is A = 1 -> B = 1.

Sequential consistency can produce non-deterministic results. This is because the sequence of sequential operations between processors can be different during different runs of the program. All memory operations need to happen in the program order.

Relaxed memory consistency models

Suppose A->B means for one processor, the operation A is done before operation B.

If W->R is violated, the order is Total Store Ordering. It is used by x86-64 architecture.

If W->W is violated, the order is Partial Store Ordering.

...

More memory models can be seen from

https://en.wikipedia.org/wiki/Consistency_model

Synchronized-with and happens-before

The synchronized-with relationship is something that you can get only between suitably tagged (the default is memory_order_seq_cst, which is a suitable tag) operations on atomic types (data structures like mutex contains these atomic types). If A writes on x and B reads on x, there is a synchronizes-with relationship between A and B.

The happens-before relationship specifies which operations see the effects of which other operations. For a single thread, the happens-before relationship can be easily determined by the program order. For multi-threading, if operation A on one thread inter-thread happens-before operation B on another thread, then A happens-before B. The inter-thread happens-before relies on the synchronizes-with relationship. If operation A in one thread synchronizes-with operation B in another thread, then A inter-thread happens-before B. This relationship is transitive.

These rules means if you make changes in one thread, you need only one synchronizes-with relationship for the data to be visible to subsequent operations on other threads.

C++ Memory Order

C++ has 6 memory ordering options on atomic types.

momory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst.

They represents 3 memory models:

Sequential Consistency (memory_order_seq_cst) Relaxed (memory_order_relaxed) Acquire-Release (memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel)

For x86-64 architecture, the acquire-release ordering do not require additional instructions. Sequential consistent ordering has small additional cost on store operations. But it will also influence the instruction reordering of compiler, so they all have potential cost except memory_order_relaxed.

In non-sequentially consistent memory orderings, threads don’t have to agree on the order of events on atomic variables. In the absence of other ordering constraints, the only requirement is that all threads agree on the modification order of each individual variable.

std::memory_order_seq_cst

If all operations on instances of atomic types are sequentially consistent, the behavior of a multithreaded program is as if all these operations were performed in some particular sequence by a single thread. This is by far the easiest memory ordering to understand, which is why it’s the default: all threads must see the same order of operations. ... It also means that operations can’t be reordered; if your code has one operation before another in one thread, that ordering must be seen by all other threads.

A sequentially consistent store synchronizes-with a sequentially consistent load of the same variable that reads the value stored.

-- C++ concurrency in action 2nd edition, P124

std::memory_order_relaxed

Operations on atomic types performed with relaxed ordering don’t participate in synchronizes-with relationships. Operations on the same variable within a single thread still obey happens-before relationships, but there’s almost no requirement on ordering relative to other threads.

-- C++ concurrency in action 2nd edition, P127

# Processor 1 x.store(true, std::memory_order_relaxed); y.store(true, std::memory_order_relaxed); # Processor 2 while (!y.load(std::memory_order_relaxed)); if (x.load(std::memory_order_relaxed)) ++z;

Here z can be 0. Since there are no ordering guarantees relating to the visibility to different threads.

The relaxed ordering gives a well-defined behavior in multi-threading compared to volatile keyword or only uses a normal variable. Since the sematic is atomic, the fetch_add method is also atomic, which means you can use it as a counter. In x86, the fetch_add method with std::memory_order_relaxed is implemented as lock xadd, which is same as using std::memory_order_seq_cst (but the former one can be reordered by the compiler, and in other architectures, the implementation may be different).

Problem may encounter in using relaxed ordering: OOTA

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1217r2.html

std::memory_order_acquire && std::memory_order_release

std::memory_order_release only used in store(), std::memory_order_acquire only used in load(). The operation pair can form a synchronizes-with relationship.

  • Any writes or reads before store() should not be moved after it.
  • Any writes or reads after load() should not be moved before it.

Typical usage:

# Processor 1 data = 100; // A ready.store(true, std::memory_order_release); // B # Processor 2 while (!ready.load(std::memory_order_acquire)) // C ; assert(data == 100); // never failed // D

TODO: std::memory_order_consume

Double-checking locking

if (!x_init.load(memory_order_acquire)) { lock_guard<mutex> _(x_init_mutex); if (!x_init.load(memory_order_relaxed)) { // <- Already hold the lock! initialize x; x_init.store(true, memory_order_release); } }

Initial load for compare-exchange

unsigned long expected = x.load(memory_order_relaxed); // <- result does not affect correctness since the CAS will check again. while (!x.compare_exchange_weak(expected, f(expected))) {}

References:

SVE

  • Scalable vector length increasing parallelism while allowing implementation choice.
  • Rich addressing modes enabling non-linear data accesses.
  • Per-lane predication allowing vectorization of loops containing complex control flow.
  • Predicate-driven loop control and management reduces vectorization overhead relative to scalar code. A rich set of horizontal operations applicable to more types of reducible loop-carried dependencies.
  • Vector partitioning and software-managed speculation enabling vectorization of loops with datadependent exits.
  • Scalarized intra-vector sub-loops permitting vectorization of loops with more complex loop-carried dependencies.

Use predicates to predict the scalable registers

This state provides thirty-two new scalable vector registers(Z0–Z31). Their width is implementation dependent withinthe aforementioned range. The new registers extend thethirty-two 128-bit wide Advanced SIMD registers (V0–V31)to provide scalable containers for 64-, 32-, 16-, and 8-bit data elements.

Arm 64FX implementation

References

  1. https://github.com/fujitsu/A64FX/tree/master/doc
  2. https://www.youtube.com/watch?v=Qma7UuYifhM
  3. https://www.youtube.com/watch?v=3TYVqodc8w4
  4. https://www.youtube.com/watch?v=H3COrJQxBkQ