Introduction

首先，欢迎大家加入GeekPie_HPC，这是一个技术中立，立志和清华比肩的比赛队伍。我们重视输赢的同时，更着重培养的是大家的实践（对各种工具的熟练应用）能力，交流（擅长画饼，接锅和再造血）能力。如果你需要花绝大多数的时间卷 GPA，本社不欢迎你，由于比赛几乎所有时间都落在期中期末考试左右，我们希望的是 YOLO 的精神。

有关如何做一个好的技术分享官，无论是学术、科研，都是在和别人 brain storm 的同时产生价值。你做的工作，只有对别人产生了价值，别人才会 value 你，光做一个技术强者是毫无作用的。希望大家珍惜与优秀的人共事的机会，在 slack周会上看看别人是怎么做的，然后自己也贡献些力所能及的事。

这里是 GeekPie_HPC 托管在 github pages 的第三个 wiki，有一部分在 geekpie 的 wiki.js 里，还有一小部分在 geekpie 校内服务器的conference 上，为了避免有一天运维删库跑路，所以先放 github 上。

本 Wiki 的同时生成静态和动态页面

静态: 使用 GitHub Actions + mdbook 生成。 https://hpc.geekpie.club/wiki/
动态: 使用 wiki.js 生成，支持实时编辑，需要学校内网。 https://wiki.geekpie.club/

添加文件

请直接在 main branch上提交 markdown 文件，在半分钟之后 wiki 就会得到更新。

如果新建了文件，需要同步更新根目录下的 SUMMARY.md 文件。

文件命名采用小端大写法。

权限申请

请向 murez 申请 Git 仓库的编辑权限，可以在 GeekPie 科创工作室 或 Slack 找到。
在学校内网，使用上科大邮箱可直接注册 wiki.js。如有困难，请到 Slack #general 频道寻求支援。

有关出入口

招新公告，slack用上科大邮箱可以注册，暑期想实习、磕盐可以在黑裙里找到相关信息。校内不定期邀请外校同学和同事和共事者演讲。

Generated GitHub Pages is powered by mdBook.

Algorithm

这里放置有关应用和测试中常见的算法。

DGemm

An problem to resolve widely used in Convolution, HPL, HPCG.

https://zhuanlan.zhihu.com/p/464740681

Numerical linear algebra is a basic calculation module in scientific computing. The calculation of solving linear equations, linear least squares problems, eigenvalues and singular values is the most computationally intensive part of scientific computing. With the advent of numerical programming, it is very effective to use complex subroutine libraries to solve such problems. When we write program code that involves linear algebra operations, we usually decompose calculations into basic subroutines such as point multiplication or matrix vector multiplication. As a result, structured programming came into being, specifying basic building blocks and using unique mnemonic names to identify these operations, and in order to improve the efficiency of the algorithmic use of these algebra programs, the names and parameter lists of some of the basic operations were uniformly planned.

From 1973 to 1977, the first "level" Basic Linear Algebra Subprograms (BLAS) identified some kernel operations, mainly Fortran specifications and implementations of subroutines for scalar and vector operations [1]. With the advent of vectors, hierarchical storage, and parallel shared memory machines, 1984-1988 successively improved the second "level" BLAS of matrix vector operations and the third "level" BLAS [2,3] of operations between matrices. Specification. The three "levels" of BLAS are not only the division of its development process, but also a measure of the complexity of the algorithm [4]. In order to further develop the BLAS library, a BLAS technical forum meeting was started at the University of Tennessee symposium in 1995 to discuss the overall functions of BLAS, sparse BLAS, dense BLAS with distributed memory, extended BLAS calculation accuracy, and mixed BLAS calculation accuracy, interval BLAS and extensions to existing BLAS.

With the continuous development of the BLAS benchmark library, it has been able to be applied to many hardware platforms and serves programs related to numerical calculation in various industries. Among them, GeneralMatrix-matrixMultiplication (GEMM) is scientific computing (high-performance computing, machine learning). For basic operations in engineering and data applications, people have been aiming for different computing platforms, trying to find corresponding optimization methods to make their calculations faster.

Problem description

For decades, General Matrix-Matrix Multiplication (GEMM) has been the standard benchmark for computing performance. GEMM is the most commonly used type of computing model in high-performance computing. Whether in the HPC field, such as FFT, convolution, correlation, filtering, etc., or in the field of DeepLearning, such as convolution layers, fully connected layers, etc., its core algorithms can be directly or indirectly converted into matrix multiplication operations. The GEMM calculation formula is as follows:

\[ C \leftarrow \alpha \ op(A)\ op(B) + \beta \ C \]

Among them, $op(X)$ represents matrix X, or transposed $X^{T}$ of matrix X, or conjugate transposed $X^{H}$ of matrix X, α and β are scalars, matrix A, m rows and k columns, matrix B, k rows and n columns, Matrix C, m rows and n columns.

There are two possible combinations of different numerical types and precisions in mixed precision:

All scalar parameters and output parameters (scalar or array) are double precision, and at least one array is single precision. Then the type of combination is as follows: (S = Singlereal, D = Doublereal, C = Singlecomplex, Z = Doublecomplex)

$\alpha$	A	B	$\beta$	C
D	S	S	D	D
D	S	D	D	D
D	D	S	D	D
Z	C	C	Z	Z
Z	C	Z	Z	Z
Z	Z	C	Z	Z

The precision of all floating-point parameters must be all single precision or all double precision. All scalar parameters and output parameters (scalars or arrays) are complex numbers (unless due to mathematical calculations, all scalar parameters must be real numbers, such as the sum in HERK). The types of combination are as follows:

$\alpha$	A	B	$\beta$	C
C	S	S	C	C
C	S	C	C	C
C	C	S	C	C
Z	D	D	Z	Z
Z	D	Z	Z	Z
Z	Z	D	Z	Z

BLAS implementations are usually optimized for calculation speed for specific machines, so using them can bring significant performance advantages. This competition focuses on the calculation performance of the single-precision real number matrix (SGEMM) on domestic advanced computing platforms. Players can refer to the rocBLAS function library [5, 6] for understanding of the relevant content. The API functions implemented by BatchSgemm in the rocblas function library For: rocblas_sgemm_strided_batched.

As the amount of data continues to increase, the size of matrices that need to be calculated in numerical calculations increases. People have proposed to use multiple batches to accelerate matrix multiplication calculations, because multi-batch matrix multiplication can better utilize the computing resources of hardware calculation accelerators. The sub-matrix in each batch in the calculation has a stride address offset and has the same size. Calculated as follows:

\[ C[i * stride _{c}] \leftarrow \alpha * o p(A[i * stride _{a}]) * op (B[i * stride _{b}])+\beta * C[i * stride _{c}] i \in[0. ,\text{batch count} - 1] \]

In order to further improve the calculation efficiency of the matrix, batch and strided calculation strategies are introduced on the basis of the original matrix multiplication. In order to make full use of the performance advantages of the GPU-like heterogeneous accelerator used in the cluster in this calculation method, the function implementation needs to be further optimized.

Test Methods

The example function to implement strided batched matrix-matrix operations is as follows. In order to facilitate the better optimization of the players, the function and its parameters are explained as follows:

sgemm strided batched(  sgemm operation trans a,
                        sgemm operation trans b,
                        intm,
                        intn,
                        int k,
                        const float* alpha
                        const float A,
                        int lda,
                        int stride a,
                        const float* B
                        int ldb,
                        int stride b,
                        const float* beta,
                        float C,
                        int ldc,
                        int stride C 
                        int batch count)
typedef enum sgemm operation_ {
        operation none = 0,
        operation transpose = 1,
        operation conjugate transpose = 2
} sgemm operation;

Input parameters:

Parameter trans_a: sgemm_operation type. Details the format of op (A) used in matrix multiplication

If trans_a = operation_none, then op (A) = A;

If trans_a = operation_transpose, then op (A) = A ';

If trans_a = operation_conjugate_transpose, then op (A) = conjg (A ').

Parameter trans_b: sgemm_operation type. The definition is the same as trans_a;

Parameter m: the number of rows of matrix A, m> 0;

Parameter n: the number of columns in matrix B, n> 0;

Parameter k: the number of columns of matrix A and the number of rows of matrix B, K> 0;

Parameter alpha: is a single-precision real number, representing the scalar coefficient of matrix A;

Parameter A: indicates that the matrix stored in the form of a pointer on the GPU is a single-precision real number matrix A;

Parameter Ida: refers to the size of the first dimension of matrix A when it is actually stored, that is, if the matrix is stored first by row, then Ida K; if it is stored first by column, Ida ≯ M

Parameter stride_a: represents the span from the A matrix to the next matrix;

Parameter B: indicates that the matrix stored in the form of a pointer on the GPU is a single-precision real number matrix B;

Parameter Idb: refers to the size of the first dimension of matrix B in actual storage, and the meaning of details is the same as lda; stride_b represents the span from the start of matrix B to the next matrix;

Parameter beta: a single-precision real number representing the scalar coefficient of matrix C. If beta = 0, you do not need to define matrix C;

Parameter C: indicates that the matrix C stores the elements in the form of pointers as single-precision real numbers;

Parameter Idc: refers to the size of the first dimension of the matrix C in actual storage, with the same details as lda;

Parameter stride_c: represents the span from the start of the C matrix to the next matrix;

Parameter batch_count: indicates the number of sgemm operations in the same batch.

Output parameters

Parameter C: C matrix covering the input.

Claim:

The player performs matrix multiplication performance optimization based on the given interface function. In a given test code, the API interface function cannot be changed. The non-fixed parameter players involved can be tuned by themselves according to the matrix size involved in the calculation;
In the test sample section, the code gives a pseudo-random number generated dense matrix as test data to verify the performance of the algorithm. The main test case sizes are the following three:

	M	N	K	Batch
1	64	64	32	30
2	128	128	64	20
3	128	512	256	10

The competitor needs to submit the function to implement part of the code, the compilation method of the function library, and the generated dynamic link library, test samples, test process, and the encrypted sequence of the running results.

Note: The contestants use the test script provided by the contest group as the basis to improve the function implementation part. To avoid affecting the contestants' performance, please compile and execute the time-encrypted serial code generated by calculating the corresponding matrix to the designated location on the web page.

SGEMM Kernel Optimization on VEGA

https://github.com/victoryang00/SGEMM_on_VEGA

Reference

C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. ACM Trans.Math.Software,5:308-323,1979.
J. J. Dongarra, J. Du Croz, I. S. Duff, and D. Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math.Software,16:1-28,1990.
J. J. Dongarra, J. Du Croz, D. Hammarling, R. J. Hanson. An extended set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Software, 14:1-32, 399, 1988.
https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
https://i-techx.github.io/iTechX/courses?course_code=CS121

FFT

An algorithm widely used in PDE solutions/ numerical emulation.

Single thread

Parallel FFT

ScalaFFT

DFT

References

https://i-techx.github.io/iTechX/courses?course_code=CS121

MHM2 Adjusting k-mers

A method of visualizing k-mers, the k-mer spectrum, shows the multiplicity of each k-mer in a sequence versus the number of k-mers with that multiplicity. It requires a DHT to store the sequence.

The default parameters are good enough for the dataset in the competition. codegen

Modifying those parameters will influence accuracy.
- Adding an iteration will slightly increase the # of long sequences, the result still stays in acceptable range, but the speed is about 1/7 slower than original.
- Removing an iteration will greatly increase speed (about 1/7 compared to original), but the result will differ dramatically from reference.
- Adjusting the values of k will not make MHM2 much faster/slower, and the result would still be acceptable if k is not changed dramatically.
From the paper, we learn that the preset k is good enough for most of the cases
- Too large k is not fair to low-coverage genomes
- Too small k may not be able to detect errors produced by the sequencer.

Applications

这里放置GeekPie_HPC参与过各个竞赛中的应用的经历和经验。

CESM

Build & Running

OneKeyConf

./create_newcase -res 0.47x0.63_gx1v6 -compset B -case ../EXP2 -mach pleiades-ivy
mkdir nobackup
ln -s /home/cesm/data/inputdata_EXP1/ nobackup/inputdata
# EXP1: ./xmlchange -file env_run.xml -id DOCN_SOM_FILENAME -val pop_frc.gx1v6.091112.nc
./xmlchange -file env_build.xml -id CESMSCRATCHROOT -val `pwd`'/nobackup/$USER'
./xmlchange -file env_build.xml -id EXEROOT -val `pwd`'/nobackup/$CCSMUSER/$CASE/bld'
./xmlchange -file env_run.xml -id RUNDIR -val `pwd`'/nobackup/$CCSMUSER/$CASE/run'
./xmlchange -file env_run.xml -id DIN_LOC_ROOT -val `pwd`'/nobackup/inputdata'
./xmlchange -file env_run.xml -id DIN_LOC_ROOT_CLMFORC -val `pwd`'/nobackup/inputdata/atm/datm7'
./xmlchange -file env_run.xml -id DOUT_S_ROOT -val `pwd`'/nobackup/$CCSMUSER/archive/$CASE'
./xmlchange -file env_run.xml -id RUN_STARTDATE -val 2000-01-01
./xmlchange -file env_build.xml -id BUILD_THREADED -val TRUE
# edit Macro SLIBS -lnetcdff
# edit env_mach_specific
./cesm_setup

ybs.sh

./EXP2.clean_build all
./cesm_setup -clean
rm -rf $build_dir
./cesm_setup
./EXP2.build

PBS

##PBS -N dappur
##PBS -q pub_blad_2
##PBS -j oe
##PBS -l walltime=00:01:00
##PBS -l nodes=1:ppn=28

Performance Tuning

Trouble Shooting

High sys percentage in top (>20%)

This is apparent this is a communication problem. Try switching to Intel MPI for a terribly low sys percentage (<1%).

ERROR: remap transport: bad departure points

Warning: Departure points out of bounds in remap                  
 my_task, i, j =         182           4           8              
 dpx, dpy =  -5925130.21408796      -0.368922055964299            
 HTN(i,j), HTN(i+1,j) =   72848.1354852604        72848.1354852604
 HTE(i,j), HTE(i,j+1) =   59395.4550164223        59395.4550164223
 istep1, my_task, iblk =     1095001         182           1      
 Global block:         205                                        
 Global i and j:          35          47                          
(shr_sys_abort) ERROR: remap transport: bad departure points      
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping     
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 182

This error may due to multiple reasons.

One significant one is the bad grid division. We were once using one PE for every processor core so the total number of PEs is not a power of 2. Then we used 128 (or later 256) and the error diminished until it showed up again after 6mos of simulation...

Then another affecting reason is the parameter xndt_dyn, see link. This parameter has already been set to 2 after solving the last problem (originally 1). Then we tried increasing this parameter again, it passed the 6mos simulation, but crashed again after another 3mos. We then continued increasing the value, but it crashes faster. We stopped at about 20mos simulation and turned to GNU compiler version with Intel MPI.

However, this does not mean it's the fault of Intel compiler. Direct comparison between Intel and GNU compilers is unfair because the combination of Intel compiler xndt_dyn=1 and most importantly the correct PE number has not been tried. Maybe try using xndt_dyn=1 from be beginning next time, using Intel compiler.

OpenMP failed

Still no solved, but very promising for improving performance.

fixed in WRF

quest analysis

program goal analysis

what's code is actually doing is to simulate quantum computing.

different bits state - qubits

3 states: 1 0 0/1

store by qreal which is actualy a complex number a+bi (a+b=1), and it can be stated as $ (\begin{smallmatrix}0.123124&0\ 0&0.876876\end{smallmatrix}) $ , also note that because gpu only support float32 computing. So native qreal (precision=4) is not supported in gpu simutation.

/*
 * Single precision, which uses 4 bytes per amplitude component
 */
# if QuEST_PREC==1
    # define qreal float
    // \cond HIDDEN_SYMBOLS   
    # define MPI_QuEST_REAL MPI_FLOAT
    # define MPI_MAX_AMPS_IN_MSG (1LL<<29) // must be 2^int
    # define REAL_STRING_FORMAT "%.8f"
    # define REAL_QASM_FORMAT "%.8g"
    # define REAL_EPS 1e-5
    # define absReal(X) fabs(X) // not fabsf(X) - better to return doubles where possible
    // \endcond
/*
 * Double precision, which uses 8 bytes per amplitude component
 */
# elif QuEST_PREC==2
    # define qreal double
    // \cond HIDDEN_SYMBOLS   
    # define MPI_QuEST_REAL MPI_DOUBLE
    # define MPI_MAX_AMPS_IN_MSG (1LL<<28) // must be 2^int
    # define REAL_STRING_FORMAT "%.14f"
    # define REAL_QASM_FORMAT "%.14g"
    # define REAL_EPS 1e-13
    # define absReal(X) fabs(X)
    // \endcond
/*
 * Quad precision, which uses 16 bytes per amplitude component.
 * This is not compatible with most GPUs.
 */
# elif QuEST_PREC==4
    # define qreal long double
    // \cond HIDDEN_SYMBOLS   
    # define MPI_QuEST_REAL MPI_LONG_DOUBLE
    # define MPI_MAX_AMPS_IN_MSG (1LL<<27) // must be 2^int
    # define REAL_STRING_FORMAT "%.17Lf"
    # define REAL_QASM_FORMAT "%.17Lg"
    # define REAL_EPS 1e-14
    # define absReal(X) fabsl(X)
    // \endcond
# endif

many matrices computation

all of the gate corresponds to one of the manipulation on qubits.

Basic operation on a and b https://arxiv.org/pdf/quant-ph/0207118.pdf

random variables = density matrix

hermitian:$\rho^{\dagger}=\rho$

positive semidefinite: eigenvalue $\geq$ 0

trace: $\Sigma(diagnal\ elements)=1$

dirac notation: ket $v_{\phi}=|\phi\rangle=\left(\begin{array}{l}\phi_{0} \\phi_{1}\end{array}\right)$

bra $ v_{\phi}^{\dagger}=\langle\phi|=\left(\begin{array}{ll}\phi_{0} & \phi_{1}\end{array}\right)$

$\langle\phi \mid \psi\rangle$= inner products of bra(fi) and ket(theta). notice: $\langle\phi \mid \phi\rangle=1$

$|\phi\rangle|\psi\rangle$=tensor product of ket(fi) and bra(theta)

2 special notation: $u_{0}=|0\rangle=\left(\begin{array}{l}1 \ 0\end{array}\right) \quad v_{1}=|1\rangle=\left(\begin{array}{l}0 \ 1\end{array}\right)$

the dense matrix:$\rho=\left(\begin{array}{cc}q_{0} & 0 \ 0 & q_{1}\end{array}\right)$ ($q_{0}+q_{1}=1$, the purpose of the equation is to illustrate the complex number ) can be stated as $\rho=q_{0}|0\rangle\left\langle 0\left|+q_{1}\right| 1\right\rangle\langle 1|$

so $\rho|0\rangle=\left(q_{0}|0\rangle\left\langle 0\left|+q_{1}\right| 1\right\rangle\langle 1|\right)|0\rangle=q_{0}|0\rangle$

dot product (from normal bits to qubits):$ |a b\rangle=|a\rangle \otimes|b\rangle=v_{00}|00\rangle+v_{01}|01\rangle+v_{10}|10\rangle \dashv v_{11}|11\rangle \rightarrow\left[\begin{array}{l}v_{00} \ v_{01} \ v_{10} \ v_{11}\end{array}\right] $

for example in bits 5 = 101b, while in qubits $|5\rangle_{3}=|101\rangle=|1\rangle|0\rangle|1\rangle=\left(\begin{array}{l}0 \ 1\end{array}\right)\left(\begin{array}{l}1 \ 0\end{array}\right)\left(\begin{array}{l}0 \ 1\end{array}\right)=\left(\begin{array}{l}0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 0 \ 0\end{array}\right)$

Hadamard gate operations

\begin{aligned}H(|0\rangle) &=\frac{1}{\sqrt{2}}|0\rangle+\frac{1}{\sqrt{2}}|1\rangle=:|+\rangle \end{aligned}

\begin{aligned} H(|1\rangle) &=\frac{1}{\sqrt{2}}|0\rangle-\frac{1}{\sqrt{2}}|1\rangle=:|-\rangle \end{aligned}

\begin{aligned} H\left(\frac{1}{\sqrt{2}}|0\rangle+\frac{1}{\sqrt{2}}|1\rangle\right) &=\frac{1}{2}(|0\rangle+|1\rangle)+\frac{1}{2}(|0\rangle-|1\rangle)=|0\rangle \end{aligned}

\begin{aligned} H\left(\frac{1}{\sqrt{2}}|0\rangle-\frac{1}{\sqrt{2}}|1\rangle\right) &=\frac{1}{2}(|0\rangle+|1\rangle)-\frac{1}{2}(|0\rangle-|1\rangle)=|1\rangle\end{aligned}

corresponding matrix operation in dirac notation: $H_{1}=\frac{1}{\sqrt{2}}\left(\begin{array}{cc}1 & 1 \ 1 & -1\end{array}\right)$

some specialty:

$H=\frac{|0\rangle+|1\rangle}{\sqrt{2}}\langle 0|+\frac{|0\rangle-|1\rangle}{\sqrt{2}}\langle 1|$
Since $HH^{\dagger}=I$ where I is the identity matrix, H is a unitary matrix (like all other quantum logical gates). Also, it is its own unitary inverse, $H=H^{\dagger}$.

One application of the Hadamard gate to either a 0 or 1 qubit will produce a quantum state that, if observed, will be a 0 or 1 with equal probability (as seen in the first two operations). This is exactly like flipping a fair coin in the standard probabilistic model of computation. However, if the Hadamard gate is applied twice in succession (as is effectively being done in the last two operations), then the final state is always the same as the initial state.

__global__ void statevec_hadamardKernel (Qureg qureg, const int targetQubit){
    // ----- sizes
    long long int sizeBlock,                                           // size of blocks
         sizeHalfBlock;                                       // size of blocks halved
    // ----- indices
    long long int thisBlock,                                           // current block
         indexUp,indexLo;                                     // current index and corresponding index in lower half block

    // ----- temp variables
    qreal   stateRealUp,stateRealLo,                             // storage for previous state values
           stateImagUp,stateImagLo;                             // (used in updates)
    // ----- temp variables
    long long int thisTask;                                   // task based approach for expose loop with small granularity
    const long long int numTasks=qureg.numAmpsPerChunk>>1;

    sizeHalfBlock = 1LL << targetQubit;                               // size of blocks halved
    sizeBlock     = 2LL * sizeHalfBlock;                           // size of blocks

    // ---------------------------------------------------------------- //
    //            rotate                                                //
    // ---------------------------------------------------------------- //

    //! fix -- no necessary for GPU version
    qreal *stateVecReal = qureg.deviceStateVec.real;
    qreal *stateVecImag = qureg.deviceStateVec.imag;

    qreal recRoot2 = 1.0/sqrt(2.0);

    thisTask = blockIdx.x*blockDim.x + threadIdx.x;
    if (thisTask>=numTasks) return;

    thisBlock   = thisTask / sizeHalfBlock;
    indexUp     = thisBlock*sizeBlock + thisTask%sizeHalfBlock;
    indexLo     = indexUp + sizeHalfBlock;

    // store current state vector values in temp variables
    stateRealUp = stateVecReal[indexUp];
    stateImagUp = stateVecImag[indexUp];

    stateRealLo = stateVecReal[indexLo];
    stateImagLo = stateVecImag[indexLo];

    stateVecReal[indexUp] = recRoot2*(stateRealUp + stateRealLo);
    stateVecImag[indexUp] = recRoot2*(stateImagUp + stateImagLo);

    stateVecReal[indexLo] = recRoot2*(stateRealUp - stateRealLo);
    stateVecImag[indexLo] = recRoot2*(stateImagUp - stateImagLo);
}

void statevec_hadamard(Qureg qureg, const int targetQubit) 
{
    int threadsPerCUDABlock, CUDABlocks;
    threadsPerCUDABlock = 128;
    CUDABlocks = ceil((qreal)(qureg.numAmpsPerChunk>>1)/threadsPerCUDABlock);
    statevec_hadamardKernel<<<CUDABlocks, threadsPerCUDABlock>>>(qureg, targetQubit);
}

Pauli-X/Y/Z gate

The Pauli-X gate acts on a single qubit. It is the quantum equivalent of the $ X=\left[\begin{array}{ll}0 & 1 \ 1 & 0\end{array}\right]$

void pauliX(Qureg qureg, const int targetQubit) {
    validateTarget(qureg, targetQubit, __func__);
    
    statevec_pauliX(qureg, targetQubit);
    if (qureg.isDensityMatrix) {
        statevec_pauliX(qureg, targetQubit+qureg.numQubitsRepresented);
    }
    
    qasm_recordGate(qureg, GATE_SIGMA_X, targetQubit);
}

the real computing part

void statevec_pauliXLocal(Qureg qureg, const int targetQubit)
{
    long long int sizeBlock, sizeHalfBlock;
    long long int thisBlock, // current block
         indexUp,indexLo;    // current index and corresponding index in lower half block

    qreal stateRealUp,stateImagUp;
    long long int thisTask;         
    const long long int numTasks=qureg.numAmpsPerChunk>>1;

    // set dimensions
    sizeHalfBlock = 1LL << targetQubit;  
    sizeBlock     = 2LL * sizeHalfBlock; 

    // Can't use qureg.stateVec as a private OMP var
    qreal *stateVecReal = qureg.stateVec.real;
    qreal *stateVecImag = qureg.stateVec.imag;

# ifdef _OPENMP
# pragma omp parallel \
    default  (none) \
    shared   (sizeBlock,sizeHalfBlock, stateVecReal,stateVecImag) \
    private  (thisTask,thisBlock ,indexUp,indexLo, stateRealUp,stateImagUp) 
# endif
    {
# ifdef _OPENMP
# pragma omp for schedule (static)
# endif
        for (thisTask=0; thisTask<numTasks; thisTask++) {
            thisBlock   = thisTask / sizeHalfBlock;
            indexUp     = thisBlock*sizeBlock + thisTask%sizeHalfBlock;
            indexLo     = indexUp + sizeHalfBlock;

            stateRealUp = stateVecReal[indexUp];
            stateImagUp = stateVecImag[indexUp];

            stateVecReal[indexUp] = stateVecReal[indexLo];
            stateVecImag[indexUp] = stateVecImag[indexLo];

            stateVecReal[indexLo] = stateRealUp;
            stateVecImag[indexLo] = stateImagUp;
        } 
    }

}

void statevec_pauliXDistributed (Qureg qureg,
        ComplexArray stateVecIn,
        ComplexArray stateVecOut)
{

    long long int thisTask;  
    const long long int numTasks=qureg.numAmpsPerChunk;

    qreal *stateVecRealIn=stateVecIn.real, *stateVecImagIn=stateVecIn.imag;
    qreal *stateVecRealOut=stateVecOut.real, *stateVecImagOut=stateVecOut.imag;

# ifdef _OPENMP
# pragma omp parallel \
    default  (none) \
    shared   (stateVecRealIn,stateVecImagIn,stateVecRealOut,stateVecImagOut) \
    private  (thisTask)
# endif
    {
# ifdef _OPENMP
# pragma omp for schedule (static)
# endif
        for (thisTask=0; thisTask<numTasks; thisTask++) {
            stateVecRealOut[thisTask] = stateVecRealIn[thisTask];
            stateVecImagOut[thisTask] = stateVecImagIn[thisTask];
        }
    }
}

__global__ void statevec_pauliXKernel(Qureg qureg, const int targetQubit){
    // ----- sizes
    long long int sizeBlock,                                           // size of blocks
         sizeHalfBlock;                                       // size of blocks halved
    // ----- indices
    long long int thisBlock,                                           // current block
         indexUp,indexLo;                                     // current index and corresponding index in lower half block

    // ----- temp variables
    qreal   stateRealUp,                             // storage for previous state values
           stateImagUp;                             // (used in updates)
    // ----- temp variables
    long long int thisTask;                                   // task based approach for expose loop with small granularity
    const long long int numTasks=qureg.numAmpsPerChunk>>1;

    sizeHalfBlock = 1LL << targetQubit;                               // size of blocks halved
    sizeBlock     = 2LL * sizeHalfBlock;                           // size of blocks

    // ---------------------------------------------------------------- //
    //            rotate                                                //
    // ---------------------------------------------------------------- //

    //! fix -- no necessary for GPU version
    qreal *stateVecReal = qureg.deviceStateVec.real;
    qreal *stateVecImag = qureg.deviceStateVec.imag;

    thisTask = blockIdx.x*blockDim.x + threadIdx.x;
    if (thisTask>=numTasks) return;

    thisBlock   = thisTask / sizeHalfBlock;
    indexUp     = thisBlock*sizeBlock + thisTask%sizeHalfBlock;
    indexLo     = indexUp + sizeHalfBlock;

    // store current state vector values in temp variables
    stateRealUp = stateVecReal[indexUp];
    stateImagUp = stateVecImag[indexUp];

    stateVecReal[indexUp] = stateVecReal[indexLo];
    stateVecImag[indexUp] = stateVecImag[indexLo];

    stateVecReal[indexLo] = stateRealUp;
    stateVecImag[indexLo] = stateImagUp;
}

void statevec_pauliX(Qureg qureg, const int targetQubit) 
{
    int threadsPerCUDABlock, CUDABlocks;
    threadsPerCUDABlock = 128;
    CUDABlocks = ceil((qreal)(qureg.numAmpsPerChunk>>1)/threadsPerCUDABlock);
    statevec_pauliXKernel<<<CUDABlocks, threadsPerCUDABlock>>>(qureg, targetQubit);
}

source code analysis

tree

.
├── CMakeLists.txt
├── include
│   ├── QuEST_complex.h				 //determine to use native external cpp support or c complex support.
│   ├── QuEST.h								  //main func claim
│   └── QuEST_precision.h				//define the precision
└── src
    ├── CMakeLists.txt
    ├── CPU
    │   ├── CMakeLists.txt
    │   ├── QuEST_cpu.c
    │   ├── QuEST_cpu_distributed.c	//distributed activator and implementation
    │   ├── QuEST_cpu_internal.h		 //other cpu related headers here
    │   └── QuEST_cpu_local.c			   //only cpu implementation
    ├── GPU
    │   ├── CMakeLists.txt
    │   └── QuEST_gpu.cu 					 //gpu counterpart
    ├── mt19937ar.c							  //梅森旋轉算法-伪随机数矩阵生成
    ├── mt19937ar.h
    ├── QuEST.c										//main func definition
    ├── QuEST_common.c					  //func activator defined here
    ├── QuEST_debug.h						  //debug information here
    ├── QuEST_internal.h
    ├── QuEST_qasm.c							//is a quantum record standard, defined qasm assertion here.
    ├── QuEST_qasm.h
    ├── QuEST_validation.c					//assert number of qubit here
    └── QuEST_validation.h

https://www.quantum-inspire.com/kbase/introduction-to-quantum-computing

testcase analysis

mytimer.hpp

#include <time.h>
#include <sys/time.h>
 
 double get_wall_time(){
 /* A time value that is accurate to the nearest
   microsecond but also has a range of years.  */
   struct timeval time;
 // __time_t tv_sec;		/* Seconds.  */
 // __suseconds_t tv_usec;	/* Microseconds.  */
   if (gettimeofday(&time,NULL)){
     // Handle error
     return 0;
   }
 
   return (double)time.tv_sec + (double)time.tv_usec * .000001;
 }
 
 double get_cpu_time(){
   return (double)clock() / CLOCKS_PER_SEC;//directly read clock from cpu, and return with clock times cloacks per sec.
 }```

random.c - random manipulation

// total number of qubit: 30
// total number of qubit operatations: 667
// estimated time: 3783.9266747315614 second.
#include "QuEST.h"
#include "mytimer.hpp"
#include "stdio.h"

int main(int narg, char *argv[])
{

    QuESTEnv Env = createQuESTEnv();
    double t1 = get_wall_time();//define starting time 

    FILE *fp = fopen("probs.dat", "w");//open file for result
    if (fp == NULL) {
        printf("    open probs.dat failed, Bye!");
        return 0;
    }

    FILE *fvec = fopen("stateVector.dat", "w");
    if (fp == NULL) {
        printf("    open stateVector.dat failed, Bye!");
        return 0;
    }

    Qureg q =  createQureg(30, Env);//define qubits registers

    float q_measure[30];// defined q's size
   // possible execution.
    tGate(q, 25);
    controlledNot(q, 28, 21);
    controlledRotateX(q, 17, 5, 0.3293660327520663);
    tGate(q, 3);
    rotateX(q, 10, 4.734238389048838);
    rotateY(q, 8, 4.959946047271496);
    rotateZ(q, 5, 1.0427019597472071);
    pauliZ(q, 0);
	...
        
    printf("\n");
    for (long long int i = 0; i < 30; ++i) {
        q_measure[i] = calcProbOfOutcome(q, i, 1);
        printf("  probability for q[%2lld]==1 : %lf    \n", i, q_measure[i]);
        fprintf(fp, "Probability for q[%2lld]==1 : %lf    \n", i, q_measure[i]);
    }
    fprintf(fp, "\n");
    printf("\n");

    for (int i = 0; i < 10; ++i) {
        Complex amp = getAmp(q, i);
        printf("Amplitude of %dth state vector: %12.6f,%12.6f\n", i, amp.real,
               amp.imag);
    }

    double t2 = get_wall_time();
    printf("Complete the simulation takes time %12.6f seconds.", t2 - t1);
    printf("\n");
    destroyQureg(q, Env);
    destroyQuESTEnv(Env);

    return 0;
}

GHZ_QFT.c - only controlled manipulation

/* GHZ quantum circuit */
    hadamard(q, 0);
    controlledNot(q, 0, 1);
    controlledNot(q, 1, 2);
    controlledNot(q, 2, 3);
    controlledNot(q, 3, 4);
    controlledNot(q, 4, 5);
    controlledNot(q, 5, 6);
    controlledNot(q, 6, 7);
    controlledNot(q, 7, 8);
    controlledNot(q, 8, 9);
    controlledNot(q, 9, 10);
    controlledNot(q, 10, 11);
    controlledNot(q, 11, 12);
    controlledNot(q, 12, 13);
    controlledNot(q, 13, 14);
    controlledNot(q, 14, 15);
    controlledNot(q, 15, 16);
    controlledNot(q, 16, 17);
    controlledNot(q, 17, 18);
    controlledNot(q, 18, 19);
    controlledNot(q, 19, 20);
    controlledNot(q, 20, 21);
    controlledNot(q, 21, 22);
    controlledNot(q, 22, 23);
    controlledNot(q, 23, 24);
    controlledNot(q, 24, 25);
    controlledNot(q, 25, 26);
    controlledNot(q, 26, 27);
    controlledNot(q, 27, 28);
    controlledNot(q, 28, 29);
	/* end of GHZ circuit */

	/* QFT starts */
    hadamard(q, 0);
    controlledRotateZ(q, 0, 1, 1.5708);
    hadamard(q, 1);
    controlledRotateZ(q, 0, 2, 0.785398);
    controlledRotateZ(q, 1, 2, 1.5708);
    hadamard(q, 2);
    controlledRotateZ(q, 0, 3, 0.392699);
    controlledRotateZ(q, 1, 3, 0.785398);
    controlledRotateZ(q, 2, 3, 1.5708);
    ...

available test machine

2node 16core each mpi:omp=2:16

#!/bin/sh
module purge
spack load intel ##openmpi@3.1.5/3.1.2
export PRECISION=4 ##1/2/4
CC=icc CXX=icpc cmake -DGPUACCELERATED=0 -DDISTRIBUTED=1 ..
make
export OMP_NUM_THREADS=16
export FI_PROVIDER=tcp
mpirun -machinefile mac -np 2 ./demo

profiling result

the most time-consuming part is statevec_compactUnitaryLocal

2node 16core each mpi:omp=1:32

1node 1tesla v100

script

#!/bin/sh
module purge
spack load gcc@6
spack load cuda@10.1 ## 10.2
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/cuda/lib64

export PRECISION=2 ##1/2
CC=gcc CXX=g++ cmake -DGPUACCELERATED=1 -DGPU_COMPUTE_CAPABILIty=70 ..
make
./demo

profiling result

summary

the summary for profiling of both cpu and gpu, the most time is consumed on computing the real kernel which I think the computing power is fully utilized.

Accelerated percentage of single node over omp+mpi is 319.799/220.807=1.448319120317744

Accelerated percentage of single node over single gpu is 319.799/19.328=16.54627720533642

power consumption: over cpu:

over gpu: 111W on average

Our future plan:

deploy the gpu code on multigpu using nccl.
solve the global memory store and load efficiency.

misc

Loves from Github

https://github.com/QuEST-Kit/QuEST/issues/220

Hi Jiachen,

There are no plans currently to combine distribution with GPU-acceleration. Note there are a few ways this can be done, and I suspect none really align with QuEST's design philosophy, nor are practical due to memory overheads. I've wanted to pen these thoughts for a while, so read on below if interested! :)

Firstly, QuEST uses its hardware to accelerate the simulation of a single quantum register at a time. While I think there are good uses of multi-GPU to speedup simultaneous simulation of multiple registers, this would be a totally new pattern to QuEST's simulation style. So let's consider using multi-GPU to accelerate a single register.

There are a few ways you can have "multiple GPUs":

multiple NVlinked GPUs
This is when you have multiple GPUs tightly connected with a high-bandwidth fabric (e.g. this). The bandwidth is enough that you sort of can imagine it as a single big GPU, and hence it would be worthwhile for accelerating single-register simulation. However, this only exists right now as NVLink and NVSwitch, compatible only with IBM's POWER architecture - you could argue this is still esoteric, and not worth a big refactor. Note it wouldn't actually be very hard to refactor QuEST for this platform - indeed QuEST works out-of-the-box with POWER8. But it's not something on our TODO list currently.

multiple local GPUs
This is when you have multiple GPUs on the same machine, but maybe on different sockets and hence with a much lower bandwidth between them. The most common case is two GPUs - is it worthwhile using two GPUs over one to speedup single register simulation? Often, no!
In big QC simulation, having to move memory around is often the big killer, and should be avoided where possible. Unfortunately, simulating unitaries on registers often requires moving memory. If all the memory stays in the GPU (very high "internal bandwidth"), this is ok, but copying memory to the other GPU (across the socket) will introduce a huge per-gate overhead!
Hence, using two GPUs to simulate the same register size can be slower than using just one, especially as the simulation size grows and saturates the sockets!
There's hardly a benefit from the extra VRAM too, because doubling the memory enables simulation of one additional qubit. This is not worth the slowdown, or the hardware!
Even with more than two GPUs, the connections are likely hierarchical and so even more prone to saturation.

distributed GPUs
This is when you have a GPU(s) on each distributed node of a cluster. In this circumstance, simulating a unitary gate which requires data exchange not only costs us a VRAM to RAM overhead (similar to before), but a networking overhead to talk to the other nodes! This can be somewhat improved by having a direct GPU to network-card connection (and MPI abstraction), but I believe that's pretty cutting-edge.
Let's say you have n nodes, each with a GPU and a multicore CPU, and you're resolved to a distributed simulation. When is it worthwhile to pay the extra memory overhead locally copying from RAM to VRAM (and use the GPU), over using just the CPUs? This is now the same trade-off to consider in the previous cases. So may or may not be worthwhile.

TL-DR: besides the somewhat esoteric case of having multiple tightly-connected GPUs, multi-GPU simulation introduces a new memory overhead that doesn't exist in single-GPU simulation. This overhead is almost always way longer than the time the GPU spends simulating the gate. As to whether the whole simulation is sped up by the use of multi-GPU is system and simulation specific.

https://github.com/NVIDIA/nccl/pull/316 This is a PR for people to review and provide feedback on the p2p branch (issue #212).

Looking forward to applying the P2P function to increase the power of my project!

THU published their modified version as ICS best paper
NUDT modified the code using memory offloading to main DRAM with GPU Memory.

ISC

奖项

总冠军一名，授予在整体算例以及现场呈现过程中得分最高的队伍。
HPL单项冠军一名，授予HPL比赛成绩最高的队伍。
最受欢迎奖一名，授予比赛期间得到ISC13参会者投票最多的队伍。

命题

HPL等benchmark和其他4项应用以及一道神秘赛题。

ISC 21

Rewind: https://victoryang00.cn/wordpress/2021/06/27/isc-21hui-gu/

AutoTuning 就是一个简单OI题

这题目本来是 NV 内部做 OSU 测试的一个小工具，拿出来给我们做题。题目要求根据不同 rank 之间的数据交换能力，做简单的调优，

Task 1-3: Understand MPI_alltoallv calls

Write a program with an input flag for pattern, on the Niagara cluster using 4 nodes, each with 40 ppn (full), total of 160 ppn, with balanced and unbalanced pattern.

The program should run 1000 iterations of MPI_alltoallv using the following characteristics.

Task 4: Use Go+Front end to visualize the alltoallv pattern

We chose to use Go message passing because of no need for the performance dynamically and draw using antd design graph rather than draw directly using gnuplot2 which is legacy for display. The downsides is the frontend occupy too much memory and takes a little bit longer especially for the wrf.

Task 5: Write a online algorithm to reaffine the pattern to makes it faster.

static calculation using MPI static analysis, and do a DP swap for data red heatmap part. The code

LAMMPS

Problem Discovery

Intel Package by W. Michael Brown speed up the CPU performance roughly 2 times on both broadwell and skylake chassis. The only difference is -XCORE-avx512
Communication Overhead is extremely unbalanced in Protein case because the Comm::Brick::reverse_comm calls MPI_Waitany too many times. This is solvable by define the grid box.
Kokkos by LBNL is extremely useful for resource allocation of GPU. However, GPU does not have aggressive improvement may because of the sparse data.

Result

Intel Package Buffer - cache friendly and vectorized
FFTW Comparison by project-gemmi/ mostly bfly 3D FFTW operation, which fftw is the best.

Lesson Learned

The environment variable setting may affect the efficacy of the execution of the application. Besides, it may affect the efficiency of the application.
Architecture may affect the performance.
- AVX_512 may reduce CPU frequency, hence reduce performance.
Multi-nodes do not guarantee the performance improvement.
- Communication overhead may eat the performance gain.
Dedicated package may introduce additional performance gain.
- Most of the gain comes from the USER-INTEL package (by inte| ${ }^{\oplus}$ ).
We found CMake is too smart to deal with the compiler option which trigger to half size of addme array in the half neighbor computation, once we change into make, the problem was solved.
The Protein cases still get into segfault when using the Intel Package on NSCC, we roll back to no package for that single case.

GPAW

Cython program
- Pros and Cons of Hybrid MPI/OMP
- 70% runtime in C, 30% runtime in Python
Computation intense program
- Highly depend on Math library
Hybrid MPI/OpenMP program
- Pros and Cons of Hybrid MPI/OMP
- Balance of MPI/OpenMP

GPU Accelerated

ELPA
- A highly efficient and highly scalable direct eigensolvers for symmetrix(hermitian) matrices.
- with this math library, the performance can increase 3x-5x.

Profiling

Accroding to the IPM Profile information, we figure out that MPI_Allredce is the most time comsuming.
- We have tried profile the ratio of MPI and OpenMP since it is a Hybird MPI/OpenMP program, but the performance is unstable since different python use gpaw may have different calculate routines.

Lesson Learned

Python GIL Lock sometimes make profile difficult.
Cython program usually have time cosuming part at C code, optimize this part.
Some General Math Library (such as MKL) may not help a lot with specific program, but some minor specific Library will.

MHM2

The code is written in UPC++

Intro

Multiple UPC++ backend: ibv, mpi, smp, udp
- When based on mpi, UPC++ backend use the infiniband by default.
There is no significant performance difference between mpi and ibv.
The performance degradation after the increase of nodes is more serious than expected: more # of compute nodes: better DHT performance, but more network overhead.
- Will be discussed in next few slides.
Profiling is a little bit difficult.

\[ \begin{array}{llrrrr} Conduit & Build Type & Report & System CPU & User CPU & nodes \ \hline \textcolor{red}{mpi} & \textcolor{red}{Release} & \textcolor{red}{37.36} & \textcolor{red}{02: 54.9} & \textcolor{red}{1: 35: 15} & \textcolor{red}{4} \ \hline mpi & Release & 60.74 & 01: 37.4 & 1: 19: 27 & 2 \ \hline \textcolor{red}{ibv} & \textcolor{red}{Release} & \textcolor{red}{37.27} & \textcolor{red}{02: 57.3} & \textcolor{red}{1: 36: 37} & \textcolor{red}{4} \ \hline ibv & Release & 61.69 & 01: 36.6 & 1: 19: 33 & 2 \ ibv & Debug & 112.3 & 03: 44.6 & 4: 54: 57 & 4 \ mpi & Debug & 134.4 & 06: 11.6 & 5: 57: 13 & 4 \ mpi & Release & 37.79 & 07: 31.1 & 1: 39: 17 & 4 \ mpi & Release & 545.35 & 1: 18: 27 & 18: 15: 26 & 4 \ mpi & Release & 104.88 & 02: 54.6 & 1: 08: 33 & 1 \end{array} \]

Profiling

Profiler: Intel Vtune Amplifier/Profiler, Version 2019.6 UPC++ could rely on MPI, but infiniband has to be disabled to profile MPI model.

CPU utilization will be 80% if hyperthreading is disabled.

Basically overall overhead is insignificant for small dataset (800MB)
For large dataset (40GB), overhead is not neglectable
- Not I/O bounded, network is the bottleneck
- A lot of data exchange between nodes
We exam the following two aspects: k-mers and DHT period

DHT Analysis

Three period: write only, read&write, read only.
Write only part: data will be storage localized.
Hyperscale data transmission when read-only: all to all.
Bottleneck: Transmission restrictions cause function await. This is mutually corroborated by the rate of performance degradation when the number of nodes increases: How to improve efficiency on larger clusters?

Innovation

Highly redundant distributed hash table:
- Reduce the order of the complete graph: as long as the memory allows.
- Transfer data when write-only period: Network IO not significant, generating a redundant
- For cluster with more memory: multiple redundancy.
- Both reduce compute-alns part and read-only part
Data reduction
- Raid5-like Memory model
- Using XOR to compute the data
Hyperparameter configuration
- Adjust k value in k-mers analysis
- We can achieve better results and less time comsumption by tuning the k parameter.

Lesson Learned

Setting up environment in the cluster
- Use Spack and Module to manage user-mode packages.
Learn how to use PBS and Slurm
- Need balance between core occupied and waiting time.
Any optimization in parallel program is very difficult.
- Need to thoroughly consider Network, IO, Memory and core scheduling.
Profiling in UPC++ can be hard:
- Try to use other parallelization methods.

WRF

傻逼Fortran，2021年了，居然还有人用Fortran

最好找做气象的人问问有关参数设置的问题，可惜我没找到

这是一个有关地球科学的天气模拟系统，所有有关地球科学和Fortran并行化的其他应用都可以参考一下

Task links and introductions

Practice case for ISC21 SCC

3 Domain Problem for ISC21 SCC

Install

required libs

HDF5, NetCDF-C, NetCDF-Fortran (手动安装可能更好，需要mpi)

HDF5

./configure --prefix=你的安装路径/hdf5 --enable-fortran --enable-fortran2003 --enable-parallel
make -j 48
make install

# vi ~/.bashrc
export HDF5=你的安装路径/hdf5
export PATH=$HDF5/bin:$PATH
export LD_LIBRARY_PATH=$HDF5/lib:$LD_LIBRARY_PATH
export INCLUDE=$HDF5/include:$INCLUDE
# source ~/.bashrc

NetCDF-C

./configure --prefix=你的安装路径/netcdf LDFLAGS="-L$HDF5/lib" CPPFLAGS="-I$HDF5/include" CC=mpiicc --disable-dap
make -j 48
make install

# vi ~/.bashrc
export NETCDF=/usr/local/netcdf
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH
export INCLUDE=$NETCDF/include:$INCLUDE
# source ~/.bashrc

NetCDF-Fortran

./configure --prefix=你的安装路径/netcdf CPPFLAGS="-I$HDF5/include -I$NETCDF/include" LDFLAGS="-L$HDF5/lib -L$NETCDF/lib" CC=mpiicc FC=mpiif90 F77=mpiif90 # 与NetCDF-C安装在同一目录下
make -j 48
make install

Advanced lib

PNetCDF A Parallel I/O Library for NetCDF File Access

4个node有负面效果，需要8个node及以上才会和NertCDF有异

pnetcdf.png

安装方法见官网

Main Program

经过测试，使用intelMPI会出现segment fault，OpenMPI则不会，然而intelMPI好像并没有很多提高，可以从stack size的角度尝试修正这个问题。

env setting

intel openmpi hdf5 netcdf

config and build

./configure

checking for perl5... no
checking for perl... found /usr/bin/perl (perl)
Will use NETCDF in dir: /global/software/centos-7.x86_64/modules/intel/2020.1.217/netcdf/4.7.4
HDF5 not set in environment. Will configure WRF for use without.
PHDF5 not set in environment. Will configure WRF for use without.
Will use 'time' to report timing information
$JASPERLIB or $JASPERINC not found in environment, configuring to build without grib2 I/O...
------------------------------------------------------------------------
Please select from among the following Linux x86_64 options:

  1. (serial)   2. (smpar)   3. (dmpar)   4. (dm+sm)   PGI (pgf90/gcc)
  5. (serial)   6. (smpar)   7. (dmpar)   8. (dm+sm)   PGI (pgf90/pgcc): SGI MPT
  9. (serial)  10. (smpar)  11. (dmpar)  12. (dm+sm)   PGI (pgf90/gcc): PGI accelerator
 13. (serial)  14. (smpar)  15. (dmpar)  16. (dm+sm)   INTEL (ifort/icc)
                                         17. (dm+sm)   INTEL (ifort/icc): Xeon Phi (MIC architecture)
 18. (serial)  19. (smpar)  20. (dmpar)  21. (dm+sm)   INTEL (ifort/icc): Xeon (SNB with AVX mods)
 22. (serial)  23. (smpar)  24. (dmpar)  25. (dm+sm)   INTEL (ifort/icc): SGI MPT
 26. (serial)  27. (smpar)  28. (dmpar)  29. (dm+sm)   INTEL (ifort/icc): IBM POE
 30. (serial)               31. (dmpar)                PATHSCALE (pathf90/pathcc)
 32. (serial)  33. (smpar)  34. (dmpar)  35. (dm+sm)   GNU (gfortran/gcc)
 36. (serial)  37. (smpar)  38. (dmpar)  39. (dm+sm)   IBM (xlf90_r/cc_r)
 40. (serial)  41. (smpar)  42. (dmpar)  43. (dm+sm)   PGI (ftn/gcc): Cray XC CLE
 44. (serial)  45. (smpar)  46. (dmpar)  47. (dm+sm)   CRAY CCE (ftn $(NOOMP)/cc): Cray XE and XC
 48. (serial)  49. (smpar)  50. (dmpar)  51. (dm+sm)   INTEL (ftn/icc): Cray XC
 52. (serial)  53. (smpar)  54. (dmpar)  55. (dm+sm)   PGI (pgf90/pgcc)
 56. (serial)  57. (smpar)  58. (dmpar)  59. (dm+sm)   PGI (pgf90/gcc): -f90=pgf90
 60. (serial)  61. (smpar)  62. (dmpar)  63. (dm+sm)   PGI (pgf90/pgcc): -f90=pgf90
 64. (serial)  65. (smpar)  66. (dmpar)  67. (dm+sm)   INTEL (ifort/icc): HSW/BDW
 68. (serial)  69. (smpar)  70. (dmpar)  71. (dm+sm)   INTEL (ifort/icc): KNL MIC
 72. (serial)  73. (smpar)  74. (dmpar)  75. (dm+sm)   FUJITSU (frtpx/fccpx): FX10/FX100 SPARC64 IXfx/Xlfx

Enter selection [1-75] :

dm+sm: OMP+MPI

./compile -j 6 em_real >& build_wrf.log
tail -15 build_wrf.log

finish

所有执行文件都在run文件夹中。

Run

for i in ../WRF/run/* ; do ln -sf $i $(数据所在目录) ; done

namelist.input是输入文件，其中有众多参数需要设置，可以参考WRF NAMELIST.INPUT FILE DESCRIPTION。

slurm script

#!/bin/bash -l
#SBATCH -N 4
#SBATCH --ntasks-per-node=20
#SBATCH --cpus-per-task=2
#SBATCH --ntasks=80
#SBATCH -J wrf3Dom_mpi_80_omp_2
#SBATCH -p compute
#SBATCH -t 2:00:00
#SBATCH -o wrf3Dom-%j.out
sleep 300
module load NiaEnv/2019b
module load intel/2019u4  openmpi/4.0.1
#hdf5/1.10.5
#module load netcdf/4.6.3

ulimit -c unlimited
ulimit -s unlimited

module list

export HDF5=/home/l/lcl_uotiscscc/lcl_uotiscsccs1034/scratch/nonspack/hdf5
export PATH=$HDF5/bin:$PATH
export LD_LIBRARY_PATH=$HDF5/lib:$LD_LIBRARY_PATH
export INCLUDE=$HDF5/include:$INCLUDE

export NETCDF=/home/l/lcl_uotiscscc/lcl_uotiscsccs1034/scratch/nonspack/netcdf
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH
export INCLUDE=$NETCDF/include:$INCLUDE


export KMP_STACKSIZE=20480000000


export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
cd ~/scratch/pl/orifiles
mpirun -np 80 -cpus-per-rank $SLURM_CPUS_PER_TASK ./wrf.exe

Important Notice

`stack size` and `segment fault`

ulimit sets the OS limits for the program. KMP_STACKSIZE tells the OpenMP implementation about how much stack to actually allocate for each of the stacks. So, depending on your OS defaults you might need both. BTW, you should rather use OMP_STACKSIZE instead, as KMP_STACKSIZE is the environment variable used by the Intel and clang compilers. OMP_STACKSIZE is the standard way of setting the stack size of the OpenMP threads. Note, that this problem is usually more exposed, as Fortran tends to keep more data on the stack, esp. arrays. Some compilers can move such arrays to the heap automatically, see for instance -heap-arrays for the Intel compiler.

Fortran的OMP进程会在stack里塞一大堆东西，很多时候会爆栈，所以使用Fortran和OMP的应用需要注意export KMP_STACKSIZE=20480000000, 而且gcc是OMP,icc是KMP。

Fortran and MPI

不知道是slurm还是Fortran的问题，slurm不能对Fortran的MPI程序自动分配CPU核心，所以需要手动设置，

mpirun -np 16 -cpus-per-rank $SLURM_CPUS_PER_TASK ./wrf.exe

tell mpi how many cpu cores should one mpi rank get for openmp

IPM Report env setting

IPM是一个监控MPI使用的profiler。使用IPM只需要perloadIPM的lib就可以了。但是为了完整生成报告图片，需要设定以下变量

export IPM_REPORT=full
export IPM_LOG=full

When using IPM, set above envs to make sure you can get right xml to visualize, or using https://files.slack.com/files-pri/TAXMW9014-F02586VN27L/download/ipm.ipynb to visualize

Others

训练起飞中

Incompact3D

It's the incompressible Navier–Stokes equations using sixth-order compact schemes for spatial discretization. It basically implement a ODE with numerical methods called Multistep Methods.

the Poisson equation is fully solved in spectral space using Fast Fourier Transform (FFT) routines

incopmact_3d_verstility

Intro to the algorithm and implementation

mpi_affinity

2d_mpi_affinity

Test Case Taylor

Build for MKL/FFTW3

Reminder:

Enable MKL speedup on AMD Platform

int mkl_serv_intel_cpu_true() {
	return 1;
}

FFTW3 migrate to CuFFT.

Build `libnpc` with spack

I don't know why...it fails to build when MPICXX is set...

Here is a quick hack

class Libnbc(AutotoolsPackage):
    """LibNBC is a prototypic implementation of a nonblocking
    interface for MPI collective operations. Based on ANSI C and
    MPI-1, it supports all MPI-1 collective operations in a
    nonblocking manner. LibNBC is distributed under the BSD license.
    """
    homepage = "http://unixer.de/research/nbcoll/libnbc/"
    url      = "http://unixer.de/research/nbcoll/libnbc/libNBC-1.1.1.tar.gz"

    version('1.1.1', sha256='63aa5f75f84c191da0688cb551ebd0e9e46928edfba350b2a534eb0c704dd9c3')

    depends_on("mpi")

+   def configure_args(self):
+       args = []
+       args.append("MPICXX=")
+       return args

Reference

Incompact3d: A powerful tool to tackle turbulence problems with up to $O\left(10^{5}\right)$ computational cores

NWChem

ICON

Prepare

git clone https://gitlab.dkrz.de/icon-scc-isc22/icon-scc
cd /path/to/icon-scc
git submodule init 
git submodule update

How to run

spack compile

spack install -j (nproc) -vvvv icon%gcc@6.4.0

There are some varients:

debug
cuda
openmp

run copy scripts

cd {ICON_BUILD_DIR}
export ICON_DIR={ICON_DIR}
# Copy runscript-related files when building out-of-source:
if test $(pwd) != $(cd "${ICON_DIR}"; pwd); then
  echo "Copying runscript input files from the source directory..."
  rsync -uavz ${ICON_DIR}/run . --exclude='*.in' --exclude='.*' --exclude='standard_*'
  ln -sf -t run/ ${ICON_DIR}/run/standard_*
  ln -sf set-up.info run/SETUP.config
  rsync -uavz ${ICON_DIR}/externals . --exclude='.git' --exclude='*.f90' --exclude='*.F90' --exclude='*.c' --exclude='*.h' --exclude='*.Po' --exclude='tests' --exclude='rrtmgp*.nc' --exclude='*.mod' --exclude='*.o'
  rsync -uavz ${ICON_DIR}/make_runscripts .
  ln -sf ${ICON_DIR}/data
  ln -sf ${ICON_DIR}/vertical_coord_tables
fi

Gen sbatch

cd {ICON_BUILD_DIR}
export ICON_DIR={ICON_DIR}
cd {ICON_BUILD_DIR}/run
$ICON_DIR/utils/mkexp/mkexp standard_experiments/scc.config CO2=2850

OK if

Script directory: '/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/experiments/exp_scc2850/scripts'
Data directory: '/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/experiments/exp_scc2850/outdata'
Work directory: '/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/experiments/exp_scc2850/work'

Modify sbatch

In experiments/exp_scc2850/scripts/exp_scc2850.run_start

FIX SLURM args
FIX path
- no /build/
- no /home/qinfr

- BUILD_DIR=/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/BUILD
+ BUILD_DIR=/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/
+ export PATH={cdo-1.9.10_BUILD_DIR}/bin:$PATH
...

Subsitute all /home/qinfr to /mnt/nfs4/node1/home/qinfr/

Run

sbatch exp_scc2850.run_start

Tips

How to check if compiled code uses SSE and AVX instructions?

https://stackoverflow.com/questions/47878352/how-to-check-if-compiled-code-uses-sse-and-avx-instructions

objdump -d cgribexlib.o | awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmu
lsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vf
nmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub2
13pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpb
roadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherd
d|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsll
vq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpc
mpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcomp
ressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2
ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64
x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb
|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvt
tpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vc
vttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|
vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vf
ixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrs
qrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|v
pminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatter
qq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcast
mw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgath
erpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0q
ps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpcl
assss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|
vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|v
p4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vps
hrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec
|vaesdeclast|vaesenc|vaesenclast)[ \t]/'

MiniVite

概述

资料

ghosh2018.pdf minivite-indyscc.pdf

算法

对于每个社区，可以计算 Modularity，整个图的 Modularity 就是每个社区的 Modularity 加总。通过改变社区的划分来影响 Modularity.

目标：最大化 Modularity

Louvain method 是迭代算法，初始每个节点属于自己社区。

对于节点 $u$，考虑每个节点的邻居（有边相连的节点）$v$，将 $u$ 的社区改为 $v$ 的社区会对 Modularity 有一个影响量 $\Delta Q$，$\Delta Q$ 是可以快速计算的。遍历所有邻居以后可以得到一个最大的 $\Delta Q$，如果 $\Delta Q>0$，就将 $u$ 的社区改为 $v$ 的社区。

一个迭代就是考虑一遍所有节点 $u$，当 $\Delta\text{Modularity}<\text{threshold}$ 就停止。

并行化：切分图的节点集合，分给每个计算节点一些图的节点，并行考虑图的节点。

代码比较短，可以阅读

编译

可以使用 spack 中的 miniVite, 但是版本比较老，需要改 package.py.

也可以直接用 GitHub 源编译，使用 gcc 可能要把 Makefile 中的 -xHost 替换为 -march=native，-qopenmp 替换为 -fopenmp.

https://github.com/ECP-ExaGraph/miniVite

运行

摘抄自 README

mpiexec -n 2 bin/./minivite -f karate.bin
mpiexec -n 2 bin/./minivite -l -n 100
mpiexec -n 2 bin/./minivite -n 100
mpiexec -n 2 bin/./minivite -p 2 -n 100

[On Cray systems, pass MPICH_MAX_THREAD_SAFETY=multiple or 
pass -DDISABLE_THREAD_MULTIPLE_CHECK while building miniVite.]

Possible options (can be combined):

1. -f <bin-file>   : Specify input binary file after this argument. 
2. -b              : Only valid for real-world inputs. Attempts to distribute approximately 
                     equal number of edges among processes. Irregular number of vertices
                     owned by a particular process. Increases the distributed graph creation
                     time due to serial overheads, but may improve overall execution time.
3. -n <vertices>   : Only valid for synthetically generated inputs. Pass total number of 
                     vertices of the generated graph.
4. -l              : Use distributed LCG for randomly choosing edges. If this option 
                     is not used, we will use C++ random number generator (using 
                     std::default_random_engine).
5. -p <percent>    : Only valid for synthetically generated inputs. Specify percent of overall 
                     edges to be randomly generated between processes.
6. -t <threshold>  : Specify threshold quantity (default: 1.0E-06) used to determine the 
                     exit criteria in an iteration of Louvain method.
7. -w              : Only valid for synthetically generated inputs. Use Euclidean distance as edge weight. 
                     If this option is not used, edge weights are considered as 1.0. Generate 
                     edge weight uniformly between (0,1) if Euclidean distance is not available.                    
8. -r <nranks>     : This is used to control the number of aggregators in MPI I/O and is
                     meaningful when an input binary graph file is passed with option "-f".
                     naggr := (nranks > 1) ? (nprocs/nranks) : nranks;
9. -s              : Print graph data (edge list along with weights).

作业

题目

Access the following server and download the two graph inputs (they are in a binary format). Server: "sftp indyscc@N/A" Password: "N/A"

The homework consists of two parts, and each part has two/three questions (checking the appropriate documents from the code repository can save time):

Establishing baseline performance: Download and build the default/main/master branch of miniVite (https://github.com/ECP-ExaGraph/miniVite), run it using the provided com-orkut and webbase-2001 input graphs on 1-20 nodes (to perform strong scaling experiments). Answer the following questions: How are these two input graphs different? What arguments did you choose to run miniVite? Does increasing the number of OpenMP threads help the performance (try 2-3 combinations of threads-per-process, keeping the “processes*threads-per-process” quantity the same)? Why or why not?
Performing further optimizations: Find a combination of miniVite arguments and/or macros (arguments are discussed in the README, but for macros, you may need to look elsewhere), in addition to the baseline arguments/options that you ran miniVite with in the previous step, that improves the overall performance and scalability. Compare baseline performance with the improved version – plot it (X-axis: #Processes(nodes) and Y-axis: “Average total time (in s)” as reported by miniVite), and discuss. Does your set of options affect the output quality (expressed via modularity and MODS) in any way? If so, discuss. Submission Instructions The assignment is assigned to all students. However, a single submission per team is sufficient. One member of the team can submit the assignment. The report can be a PDF file (preferred method) or a link to a google doc (we will check the timestamp for when it was last edited). Please include your team name and the university in the report.

修改 Spack 的 package.py

需要加入一些编译选项，故需要修改编译脚本：

# Copyright 2013-2022 Lawrence Livermore National Security, LLC and other
# Spack Project Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (Apache-2.0 OR MIT)
from spack.package import *
class Minivite(MakefilePackage):
    """miniVite is a proxy application that implements a single phase of
    Louvain method in distributed memory for graph community detection.
    """
    tags = ["proxy-app", "ecp-proxy-app"]
    homepage = "https://hpc.pnl.gov/people/hala/grappolo.html"
    git = "https://github.com/Exa-Graph/miniVite.git"
    version("develop", branch="master")
    version("1.0", tag="v1.0")
    version("1.1", tag="v1.1")
    variant("openmp", default=True, description="Build with OpenMP support")
    variant("opt", default=True, description="Optimization flags")
    variant("mode",default='default',description="mode",values=('collective','sendrecv','rma','default','rma_accu'))
    variant("omp_schedule", default=False, description="Enable OMP schedule")
    variant("use_32_bit_graph", default=False, description="Use 32bit graph")
    depends_on("mpi")
    @property
    def build_targets(self):
        targets = []
        cxxflags = ["-std=c++11 -g -DCHECK_NUM_EDGES -DPRINT_EXTRA_NEDGES"]
        ldflags = []
        if "+openmp" in self.spec:
            cxxflags.append(self.compiler.openmp_flag)
            ldflags.append(self.compiler.openmp_flag)
        if "+opt" in self.spec:
            cxxflags.append(" -O3 ")
        if self.spec.variants['mode'].value == 'collective':
            cxxflags.append("-DUSE_MPI_COLLECTIVES")
        elif self.spec.variants['mode'].value == 'sendrecv':
            cxxflags.append("-DUSE_MPI_SENDRECV")
        elif self.spec.variants['mode'].value == 'rma':
            cxxflags.append("-DUSE_MPI_RMA")
        elif self.spec.variants['mode'].value == 'rma_accu':
            cxxflags.append("-DUSE_MPI_RMA -DUSE_MPI_ACCUMULATE ")
        if "+omp_schedule" in self.spec:
            cxxflags.append("-DOMP_SCHEDULE_RUNTIME")
        if "+use_32_bit_graph" in self.spec:
            cxxflags.append("-DUSE_32_BIT_GRAPH")
        targets.append("CXXFLAGS={0}".format(" ".join(cxxflags)))
        targets.append("OPTFLAGS={0}".format(" ".join(ldflags)))
        targets.append("CXX={0}".format(self.spec["mpi"].mpicxx))
        return targets
    # 下面省略

本道题目在启用 USE_MPI_RMA 后，性能获得较大提升。

报告

minivite.pdf

评价

As a response to the first question, why do you think orkut's running time is longer even though it is smaller in size compared to webbase?
How many OpenMP threads per process for the baseline strong scaling experiments?
In part 1, you provide a brief discussion on observed load imbalance. But, you do not mention how you mitigated it in part 2.
It would have been interesting to modulate the threshold and measure the effect on performance, and check the impact on MODS

3/5 + 5/5 + 30/40 + 25/25 + 18/25 = 81/100

Final

题目

This assignment has two parts, strong scale and weak scale. Like in homework #1, you will download and build miniVite: https://github.com/ECP-ExaGraph/miniVite

Strong scale

Use the com-friendster graph as input to miniVite, and the optimization arguments that you learned about during the last homework to perform strong scaling experiments (any option that improves the performance is acceptable, even if quality in terms of modularity is affected somewhat).

For this input, there will be startup issues (out-of-memory related crash or slowness) if you use a relatively small #nodes to begin or limited optimization arguments.

The goal of this exercise is to find a set of arguments and options (which may differ among process configurations) that maximizes strong scalability for this input, without compromising quality/modularity too much (rounding off final modularity to the first decimal place should yield similar values no matter your choice of optimizations). (Don’t try to use -DDUSE_32BIT_GRAPH, it won’t work)

i. Pick x where x is the startup node, and then scale the #nodes by incrementing x by a fixed stride to get the next process/node configuration, continue until x == 20. (pick any combination of processes-per-node and threads-per-process that yields better performance) You can vary processes-per-node as you see fit. How did you pick the base x?

ii. Report graph loading/construction times, #iterations to convergence, the time to perform the Louvain graph clustering as reported by miniVite running on the nodes as per 1.a.i.

Also, mention the arguments that you passed to miniVite and options you build it with.

Weak scale

Use the miniVite options to generate a distributed input graph (see FAQs and README) that scales with the #processes. Pick a reasonable number of vertices (this is governed by a formula – see FAQ, if miniVite complains, just adjust the #vertices or #processes)

i. Start with 1 node (any #processes-per-node and #threads-per-process configuration that makes sense to you) and end at 20 nodes. Plot the time to generate the graph, time to perform graph clustering (using data returned by miniVite) on 1-20 nodes.

ii. How large is the graph you generated on 20 nodes vs. 1 node? (Larger is better, but too large will take too much time in graph generation).

For submission, Create two directories called weak_scale and strong_scale and put the documents that answers the questions for each category in their respective directories.

题目分析

Strong scaling, 即固定问题规模，增加并行数量，减少运行时间。理想情况是 $\text{time with (n) nodes}=\frac{\text{time with 1 node}}{\text{number of nodes}}$
Weak scaling，即固定每个并行节点的运算量，增加并行数量（问题规模同时增加）。理想情况是运行时间不变（没有任何并行带来的额外开销）。

OpenMPI & OpenMP 调参

每个机器是 2 个 E5-2660 v3，总共 20 cores.

经过一些尝试，OpenMP 开单线程，MPI 开到 20 效果最好。

PPN	OMP_NUM_THREADS	Clustering
20	1	100.445
20	2	102.023
20	4	108.56
12	3	132.041
10	2	146.281

似乎表明 OpenMP 并行效果不如 MPI，可能是 OpenMP atomic 开销太大，但是没有做过 profiling 不能确定。每个 MPI 进程都会开一个数据结构存储全图所有节点的信息，内存开销大。

mpirun --hostfile ./hostfile -n 400 -map-by core --bind-to core miniVite -f com-friendster.ungraph.bin -b -t 0.0015

Profile

原程序用的 std::set std::map std::unordered_set std::unordered_map 太慢了，换成第三方快速 HashTable 实现能加速很多倍。原算法开了一个不必要的 vector 也可以优化掉。

Weak scale

用 miniVite 自带的算法生成图，生成非常慢，而且进程个数必须是 $2^n$，导致只能跑进程数 $1,2,4,8,16,32,64,128,256$。没有用 oversubscribing 因为不太符合 weak scaling 的意思，而且跑出来数据可能会很难看（指波动大）。

提交

strong-scale-report.pdf weak-scale-report.pdf 0001-modify.patch

HPL & HPL Hero Run

此次 Indy SC 比赛中的 HPL benchmark 分为两组题目，一组为 HPL 在单节点上的调优，另外一组为在整个集群上运行 HPL (约300个节点).

HPL

Hi teams

Here is the assignment on HPL! Compute the theoretical peak FLOPs for the processor type available on Chameleon cloud for the IndySCC competition. (10 points) Build and install the HPL benchmark using your choice of linear algebra library and MPI library. Which linear algebra library did you choose and why? Which MPI library did you choose and why? (20 points) Run the HPL benchmark on a node using a fixed problem size (N) and by varying the number of cores from 2 to 20, doubling the cores at each trial. Which parameters did you need to change? Plot the GFLOPs number for each run vs. no. of cores. (30 points) Run the HPL benchmark using all 20 cores and tune the HPL.dat file to achieve the best GFLOPs number. Which parameters did you tune? What were your results using the unoptimized parameter(s) vs. the optimized parameter(s)? (40 points) (Bonus) Run HPL on 2-nodes. Was the GFLOPs number exactly twice that of your single-node performance? Why or why not? (10 points) Deliverables: Submit a report answering the questions in the assignment. While describing your steps, be brief and to the point. You should also include a description of your cluster and how you created the cluster. For each of the steps 2-5, include the scripts that you used to build and run HPL. Provide sufficient documentation for your codes. Please note that your environment and scripts should be reproducible, i.e., the judges should be able to run them on an empty cluster. For the tuning step 4, include the final HPL.dat file that gave you the best performance. Submission Instructions The assignment is assigned to all students. However, a single submission per team is sufficient. One member of the team can submit the assignment. The report can be a PDF file (preferred method) or a link to a google doc (we will check the timestamp for when it was last edited). Please include your team name and the university in the report.

HPL Hero Run

Between now and Oct 20 you are allowed to burst up to 30 compute nodes to test scaling your node deployment processes. Hero HPL Runs On Oct 20 we will begin hero HPL runs. Each team will get a 23 hour window where they will be allowed to use up 300 compute nodes to complete the best HPL score they can get. Ground Rules During this time, the other teams will only be allowed to use their 1 head node to continue testing. At the beginning of this phase we will shut down any compute nodes still running. Resource usage during this time will be closely monitored - any interference, accidental or otherwise, with the competing team will be penalized per the IndySCC rules and at the discretion of the committee. Schedule Each window will begin at 9 am Eastern Time and will end at 8 am Eastern Time the following day. We will need an hour to make sure your nodes are shut down and ready to go for the next team. The schedule was randomly generated and assigned and is listed below. If you would like to swap dates, you are responsible for finding another team to swap with. You may use the Google classroom to ask if anyone else needs to swap and is willing. Once the first team begins, the schedule will be locked in place. Otherwise, we cannot change any dates unless mutually agreed upon and there isn’t enough time left in the month to have alternative days.

Thu Oct 20 - Durham University Fri Oct 21 - CUHK Sat Oct 22 - Georgia Tech Sun Oct 23 - SUSTech Mon Oct 24 - ShanghaiTech Tue Oct 25 - CSC/Finland Wed Oct 26 - Monash University Thu Oct 27 - Clemson Fri Oct 28 - U Indonesia Sat Oct 29 - UTEP Sun Oct 30 - TAMUCC Helpful Hints A successful team may strategize to run HPL in increasing increments of nodes, rather than try for the full 300 nodes at once. It is OK if you don’t use the full number of nodes with your final score, getting hundreds of nodes to work nicely together in real-life benchmarking exercises takes time and may be difficult to do in the short period you have them. It would be advantageous to have a very good score on a smaller set of nodes than to struggle to get the full 300 running, run out of time, and not have a score at all.

Also keep in mind that this hardware is fairly old and there are likely a handful of bad/slow nodes. This is where slowly increasing the number of nodes you are using can come in handy. Run Into Hardware Problems or Outages?

If you identify a slow node, we will not be able to fix it as the only spare parts are in the form of the other nodes, and it’s unlikely we will be able to fix it by the end of your window. You should exclude the slow node and move onto another node. If you scale up to the full 300 nodes, you may deploy extras, just make note of the slow nodes and document that your final submission uses no more than 300.

If there is any disruption of resource availability during your window, such as a Chameleon outage or power/networking outage at the Purdue site, you may receive some make-up time in certain situations, as described in the next paragraph.

If the resources are cumulatively unavailable for more than 4 hours, at the discretion of the committee, you may receive an equivalent time slot (ie, if you were unable to access for 5 hours, you would get another 5 hours) plus an extra hour for node spin-up at the end to try again. This time would be disjoint and come at the end of the above schedule as we can’t shift the rest of the schedule for the remaining teams.

If you are the last team, we will ask that you shut down at your designated time, as we need time to determine the outage length adjustments for everyone. You’ll be able to restart at a later time, and that will keep things fair as the other teams wouldn’t have the ability to just extend their window.

Outages of less than 4 hours will unfortunately be considered lost time. These shorter blips are part of real life challenges, and it would not be practical to allow make-up time for shorter times as you’d spend more time just spinning nodes back up.

We also cannot accommodate internet outages at your locale as we can’t verify those outages, so you may want to have a plan to find another connection if that happens.

Once all teams are done running, we will open the nodes back up for you to configure for the final 48 hour competition. Submitting results

Final score submissions are due 1 hour after your window ends, right as the next team is starting. We will follow up with more details on this.

报告

单节点多节点 hero run 相关文件

参考资料

Official Documentation AMD HPL Benchmark Run HPL on Threadripper 基于 HPL测试的集群系统性能分析与优化

方案

调优 HPL.dat

具体过程请参见报告。

在 HPL.dat 中可以使用多组参数，这样跑一次 HPL 可以得到多个测试结果，效率会高一些。 {.is-info}

调优 MPI 参数

进程还是线程

HPL 底层数学库可以利用多线程，所以可以让1个 HPL 进程利用整个 Socket。具体哪种的性能更好需要测试。

Intel 数学库有三套方案：单线程执行，OpenMP 和 Intel TBB，后两者可以利用多核。

具体调优情况参见报告。

绑核

numactl 可以用来绑定 HPL 使用的内存在哪一个 numa 上。绑核的教程参见： IBM MPI documentation Understanding-MPI-map-by-and-bind-to-option

Author

@Zecheng Li

Tutorial

Video: https://drive.google.com/file/d/1c2bD3gZw5ZeJS81i1uXY6eAXf70t8bZo/view

NAMD 2.14 User’s Guide: https://www.ks.uiuc.edu/Research/namd/2.14/ug/

NAMD Tutorial: http://www.ks.uiuc.edu/Training/Tutorials/namd/namd-tutorial-unix-html/index.html

VMD User’s Guide: https://www.ks.uiuc.edu/Research/vmd/current/ug/

VMD Tutorial: https://www.ks.uiuc.edu/Training/Tutorials/vmd/tutorial-html/

Building

We use spack as the package manager. To build a simple version without MPI support:

spack install namd

In our competition, we need to support multiple nodes, so we choose to install charmpp with MPI backend and SMP enabled. The TCL interface is included for parsing the input file. The below command will depend on the OpenMPI with ucx support.

spack install -v namd%gcc interface=tcl ^charmpp backend=mpi ^openmpi fabrics=ucx

We could also use a pure MPI version. But unlike some other applications, NAMD optimized its performance for multi-threading, so the SMP version is usually faster than MPI when a single node has multiple cores. When running, we should use more communication threads (one per numa) for larger-scale jobs.

Tuning performance

We could try different compilers to build NAMD. The performance critical part of NAMD is the force calculation implemented in the source of NAMD itself (instead of in some math libraries), so compiler optimization is crucial for the performance.

We could try different compilers: (oneapi is not supported by charm++ that NAMD 2.14 depends on)

spack install -v namd%intel interface=tcl ^charmpp%intel backend=mpi ^openmpi fabrics=ucx
spack install -v namd%nvhpc interface=tcl ^charmpp%nvhpc backend=mpi ^openmpi fabrics=ucx
spack install -v namd%clang interface=tcl ^charmpp%clang backend=mpi ^openmpi fabrics=ucx

As you may find, Intel compiler might yield better performance than gcc, and NVHPC and clang are extremely slow. Don't worry. Have a look at the build scripts of NAMD.

spack edit namd

if self.spec.satisfies("^charmpp@:6.10.1"):
 optims_opts = {
 "gcc": m64
 + "-O3 -fexpensive-optimizations \
 -ffast-math -lpthread "
 + archopt,
 "intel": "-O2 -ip -qopenmp-simd" + archopt,
 "aocc": m64
 + "-O3 -ffp-contract=fast -ffast-math \
 -fopenmp "
 + archopt,
 }
else:
 optims_opts = {
 "gcc": m64
 + "-O3 -fexpensive-optimizations \
 -ffast-math -lpthread "
 + archopt,
 "intel": "-O2 -ip " + archopt,
 "aocc": m64
 + "-O3 -ffp-contract=fast \
 -ffast-math "
 + archopt,
 }

It did not set the optimization flags for clang and NVHPC. We could add them by ourselves. Below is an example; you should try different flags based on your architecture.

"clang": m64 + "-O3 -ffp-contract=fast -ffast-math -mprefer-vector-width=256 " + archopt,
"nvhpc": m64 + "-O3 -fast " + archopt,

Then we could build NAMD with clang and NVHPC. Surprisingly, clang is faster than Intel compiler on our machine (Haswell).

From the charm++ documentation, we could find that we can compile different load balancer modules into NAMD with different flags, but the default spack build script did not include them. Having a suitable load balancer for your architecture and interconnect is important. Some load balancers might cause a compile error since multiple definitions.

-module TreeLB -module RecBipartLB ... links the listed LB modules into an application, which can then be used at runtime via the +balancer option.

spack edit namd

# in function def edit(self, spec, prefix): add line
opts.extend(["-module CentralLB -module DistributedLB"])

From our experience, the default load balancer, the CentralLB, and the DistributedLB should be chosen based on the input and the architecture. It will bring about a 5-10% performance difference. You can also experiment with other load balancers or even write your own load balancer (not that hard!).

There are also some other flags that could be tuned. Since our machine does not support AVX512, we did not try the Intel-optimized AVX512 blocking version of NAMD.

Assignment

Quick Start & Homework: https://gitlab.msu.edu/vermaaslab/indysccnamd/-/tree/main

Running Command:

# One per node with SMP
mpirun -np 20 -hostfile ~/work/host20 -bind-to core -map-by ppr:1:node -x PATH namd2 +ppn 19 +pemap 1-19 +commap 0 run.namd

# One per NUMA with SMP (You should check your NUMA topology first)
mpirun -np 40 -hostfile ~/work/host20 --bind-to core --map-by ppr:2:node  -x PATH  namd2   +ppn 9 +pemap 1-4,10-14,5-9,15-19 +commap 0,5 ./run.namd

# Run with 19 replicas
mpirun -np 19 -hostfile ~/work/host20 --bind-to core --map-by ppr:1:node  -x PATH namd2 +replicas 19 +balancer DistributedLB +ppn 20 +pemap 0-19 +commap 0 +stdout output/out.%d.log ./replicaconfig.namd

Here we oversubscribe the cores. Since core 0 is lightly loaded when communication is not heavy, we can also assign it to computation. Note that in above commands, we always put core 0 or core 5 to communication. This is because we have set most of the interrupt affinity to core 0 and core 5, using them could get better performance on both communication and computation. (it will be up to 5% slower if you use other cores)

MPI Reference: https://www.ks.uiuc.edu/Research/namd/2.9/ug/node87.html
Our submission: https://drive.google.com/file/d/1HqxWP6YJIr06wz6ANMHog3v59HnhV7T2/view?usp=share_link

Final

Problem Set: https://drive.google.com/file/d/1zyWpv-bfN2uzke7RqnpS8PI6Q-AQFtKb/view?usp=share_link
Our Submission: https://drive.google.com/drive/folders/1dpVS6027vJTsbxlOjfMEGdftOfQyCmlO?usp=share_link

Experience

NAMD配置文件参数介绍：

NAMD configuration parameters： https://www.ks.uiuc.edu/Research/namd/2.9/ug/node12.html
Non-bonded Interaction & Parameters: https://www.ks.uiuc.edu/Research/namd/2.10b2/ug/node23.html

调参方法：

整体思路：在不跑崩的范围内，timestep和nonbondedFreq的乘积、timestep和fullElectFrequency的乘积尽可能大
输出间隔也对性能有影响，调小后约能提升5-10%性能
Reference：
1. https://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdPerformanceTuning
2. https://www.ks.uiuc.edu/Research/namd/cvs/ug/node95.html

SC21

Website: https://sc21.supercomputing.org/program/studentssc/student-cluster-competition/ Rewind: https://victoryang00.cn/wordpress/2021/11/18/sc21-shi-bai-hui-gu/

Quantum Espresso

https://github.com/QEF/q-e

compile

Could not find MPI (Missing MPI_FORTRAN_FOUND)

solve: -DMPIEXEC_EXECTUABLE=${MPI_HOME}/bin/mpiexec

The compiled version does not support OpenMP and only support up to 4 processes for MPI.

Add the options:

-DQE_ENABLE_OPENMP=ON -DCMAKE_Fortran_COMPILER=${MPI_HOME}/bin/mpifort -DOpenMP_C_FLAGS=-fopenmp=lomp -DOpenMP_CXX_FLAGS=-fopenmp=lomp -DOpenMP_C_LIB_NAMES=libomp -DOpenMP_CXXLIB_NAMES=libomp -DOpenMP_libomp_LIBRARY=/usr/lib/x86_64-linux-gnu/libomp.so.5

Change Toolchain to System.

Add -g to CMakeList.txt to get additional debug information.

set(CMAKE_CXX_FLAGS -g) set(CMAKE_C_FLAGS -g) set(CMAKE_Fortran_FLAGS -g)

https://www.quantum-espresso.org/Doc/user_guide/

library configure: https://www.quantum-espresso.org/Doc/user_guide/node11.html

test

In directory /q-e/test-suite/, use make run-tests to test the correctness of basic functionalities.

run

spack load ucx/gji

/home/qe/q-e/bin/pw.x

To control the number of processors in each group, command line switches: -nimage, -npools, -nband, -ntg, -ndiag or -northo (shorthands, respectively: -ni, -nk, -nb, -nt, -nd) are used. As an example consider the following command line: mpirun -np 4096 ./neb.x -ni 8 -nk 2 -nt 4 -nd 144 -i my.input This executes a NEB calculation on 4096 processors, 8 images (points in the configuration space in this case) at the same time, each of which is distributed across 512 processors. k-points are distributed across 2 pools of 256 processors each, 3D FFT is performed using 4 task groups (64 processors each, so the 3D real-space grid is cut into 64 slices), and the diagonalization of the subspace Hamiltonian is distributed to a square grid of 144 processors (12x12).

mpirun -np 24 -x PATH --oversubscribe -x OMP_NUM_THREADS=4 -x LD_LIBRARY_PATH=/opt/nonspack/ucx-1.10.0-gcc/lib --allow-run-as-root /home/qe/q-e/bin/pw.x < ./ausurf.in

First run with 24 processes and 4 thread each:

Problem: OMP threads can only use up to 200% CPU per process even with 256 threads per process.

Analyze

Static Analysis

Using lizard

Fortran:

Total nloc  Avg.NLOC  AvgCCN  Avg.token  Fun Cnt  Warning cnt  Fun Rt  nloc Rt
599949      54.1      10.6    569.7      9939     1693         0.17    0.58

Total nloc  Avg.NLOC  AvgCCN  Avg.token  Fun Cnt  Warning cnt  Fun Rt  nloc Rt
52039       152.5     3.0     1050.3     323      19           0.06    0.53

Python:

Total nloc  Avg.NLOC  AvgCCN  Avg.token  Fun Cnt  Warning cnt  Fun Rt  nloc Rt
8864        18.3      5.0     146.0      298      21           0.07    0.26

Profiling result

All the GPU versionn test case seems to have IEEE underflow, trigger by the FFTlib, which should be fixed. Since the developing team of this project still aggressively develop the application to tailor to GPU.

We chose to use a case called si.scf.david.in to profile on single GPU. And here's the profiling result.

=117847== Profiling application: /home/qe/bin/pw.x -i ./si.scf.david.in
==117847== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:    8.72%  22.118ms       140  157.98us  157.82us  159.81us  usnldiag_collinear_79_gpu
                    6.46%  16.390ms      1360  12.051us  11.840us  189.41us  init_us_2_base_gpu_216_gpu
                    5.29%  13.411ms        10  1.3411ms  1.3407ms  1.3417ms  rotate_wfc_k_gpu_146_gpu
                    4.24%  10.763ms       370  29.090us  28.704us  32.928us  ylmr2_gpum_ylmr2_gpu_kernel_
                    3.71%  9.4250ms      1127  8.3620us  6.5280us  17.664us  volta_zgemm_32x32_nn
                    3.23%  8.1880ms      1224  6.6890us  6.5920us  7.1040us  init_us_2_base_gpu_220_gpu
                    2.68%  6.8026ms       680  10.003us  9.8560us  10.784us  init_us_2_base_gpu_185_gpu
                    2.67%  6.7818ms       340  19.946us  19.744us  21.280us  init_us_2_base_gpu_206_gpu
                    2.61%  6.6090ms       340  19.438us  19.295us  21.504us  init_us_2_base_gpu_158_gpu
                    2.46%  6.2396ms       689  9.0560us  7.2000us  14.432us  void zgemm_largek_warp<bool=1, bool=0, bool=1, bool=0, int=3, int=3, int=4, int=3, int=2, int=2, int=9>(double2*, double2 const *, double2 const *, int, int, int, int, int, int, double2 const *, double2 const *, double2, double2, int, int, int*, int*)
                    2.28%  5.7953ms       159  36.448us  19.392us  43.200us  cegterg_gpu_493_gpu
                    2.20%  5.5704ms      1104  5.0450us  4.1600us  11.488us  void composite_2way_fft<unsigned int=20, unsigned int=4, unsigned int=32, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=5, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                    2.17%  5.4956ms       478  11.497us  11.359us  12.864us  add_vuspsi_k_gpu_242_gpu
                    1.98%  5.0265ms       239  21.031us  10.208us  40.384us  vloc_psi_k_gpu_464_gpu
                    1.86%  4.7254ms       219  21.577us  12.319us  33.824us  void sytd2_upper_cta<double2, double, int=4>(int, double2*, unsigned long, double*, double*, double2*)
                    1.71%  4.3307ms       219  19.774us  19.743us  20.960us  laxlib_cdiaghg_gpu_349_gpu
                    1.64%  4.1660ms       239  17.430us  17.248us  19.488us  vloc_psi_k_gpu_477_gpu
                    1.48%  3.7585ms         1  3.7585ms  3.7585ms  3.7585ms  force_corr_gpu_103_gpu
                    1.45%  3.6914ms       239  15.444us  15.264us  16.704us  vloc_psi_k_gpu_456_gpu
                    1.40%  3.5579ms      2320  1.5330us  1.4080us  13.056us  [CUDA memcpy DtoH]
                    1.36%  3.4570ms       219  15.785us  15.712us  16.352us  laxlib_cdiaghg_gpu_317_gpu
                    1.34%  3.4099ms       159  21.445us  21.280us  23.136us  g_psi_gpu_53_gpu
                    1.28%  3.2424ms      1979  1.6380us  1.2160us  13.120us  [CUDA memcpy HtoD]
                    1.22%  3.0915ms       552  5.6000us  4.2880us  9.0560us  void composite_2way_fft<unsigned int=20, unsigned int=4, unsigned int=16, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=5, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>)
                    1.19%  3.0239ms       239  12.652us  10.816us  14.240us  h_psi__gpu_158_gpu
                    1.14%  2.8893ms       219  13.193us  9.2160us  20.192us  void trsm_ln_up_kernel<double2, unsigned int=32, unsigned int=32, unsigned int=4, bool=0>(int, int, double2 const *, int, double2*, int, double2, double2 const *, int, int*)
                    1.12%  2.8463ms      1095  2.5990us  2.4960us  3.2640us  copy_info_kernel(int, int*)
                    1.06%  2.6975ms       170  15.867us  15.647us  16.544us  init_us_2_base_gpu_119_gpu
                    1.02%  2.5845ms        40  64.612us  64.320us  72.960us  stres_us_k_gpu_702_gpu
                    1.01%  2.5699ms       159  16.162us  16.096us  16.704us  reorder_evals_cevecs_707_gpu
                    0.99%  2.5005ms        40  62.512us  62.240us  70.656us  stres_us_k_gpu_817_gpu
                    0.97%  2.4644ms       159  15.499us  15.232us  16.576us  cegterg_gpu_427_gpu
                    0.96%  2.4360ms        70  34.799us  34.720us  35.424us  cegterg_gpu_265_gpu
                    0.89%  2.2453ms        40  56.131us  55.840us  63.040us  stres_knl_gpu_100_gpu
                    0.86%  2.1855ms        40  54.636us  54.463us  56.832us  stres_us_k_gpu_543_gpu
                    0.82%  2.0773ms       243  8.5480us  7.2320us  11.904us  fft_scalar_cufft_cfft3d_gpu_586_gpu
                    0.82%  2.0749ms       280  7.4100us  7.3280us  7.8720us  get_rho_gpu_954_gpu
                    0.80%  2.0350ms       212  9.5990us  9.4080us  10.016us  dp_dev_memcpy_c2d_770_gpu
                    0.71%  1.7922ms       689  2.6010us  2.4960us  3.7440us  void scal_kernel<double2, double2, int=1, bool=1, int=5, int=4, int=4, int=4>(cublasTransposeParams<double2>, double2 const *, double2*, double2 const *)
                    0.70%  1.7640ms       159  11.094us  10.912us  11.744us  cegterg_gpu_376_gpu
                    0.67%  1.7032ms       508  3.3520us  3.1670us  4.4480us  void reduce_1Block_kernel<double, int=128, int=7, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>>(double const *, double, double, int, double const *, double, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>, cublasPointerMode_t, cublasLtEpilogue_t, cublasGemvTensorStridedBatched<biasType<cublasGemvTensorStridedBatched<double>::value_type, double>::type const >)
                    0.67%  1.7000ms       508  3.3460us  3.1680us  4.8640us  void dot_kernel<double, int=128, int=0, cublasDotParams<cublasGemvTensor<double const >, cublasGemvTensorStridedBatched<double>>>(double const )
                    0.66%  1.6738ms        40  41.843us  41.760us  42.944us  stres_us_k_gpu_617_gpu
                    0.66%  1.6658ms       159  10.476us  10.432us  11.136us  reorder_evals_cevecs_700_gpu
                    0.54%  1.3789ms       219  6.2960us  5.1840us  8.9280us  void potrf_alg2_cta_upper<double2, double, int=32>(int, int, double2*, unsigned long, int*)
                    0.53%  1.3506ms       170  7.9440us  7.8400us  8.6080us  init_us_2_base_gpu_134_gpu
                    0.53%  1.3341ms       438  3.0450us  2.4960us  188.80us  void lapack_identity_kernel<double, int=8>(int, int, double*, int)
                    0.52%  1.3279ms       219  6.0630us  5.0880us  8.6400us  void trsm_right_kernel<double2, int=256, int=4, bool=0, bool=0, bool=0, bool=1, bool=0>(cublasTrsmParams<double2>, double2, double2 const *, int)
                    0.52%  1.3185ms       219  6.0200us  4.3200us  8.2880us  void ormql_cta_kernel<double2, int=4, int=1>(int, int, int, double2 const *, unsigned long, double2 const *, double2*, unsigned long, int, int, int, int)
                    0.52%  1.3185ms        90  14.649us  14.496us  15.072us  dylmr2_gpu_78_gpu
                    0.51%  1.2925ms       209  6.1840us  6.1440us  6.4640us  dp_dev_memcpy_r1d_270_gpu
                    0.50%  1.2803ms        71  18.033us  17.983us  18.687us  cegterg_gpu_615_gpu
                    0.50%  1.2592ms       438  2.8740us  2.7200us  3.8720us  void kernel_extract_uplo_A<double2, int=5, int=3>(int, double2 const *, unsigned long, double2*, unsigned long, int)
                    0.50%  1.2586ms       163  7.7210us  7.5840us  8.0000us  dp_dev_memset_c2d_1851_gpu
                    0.47%  1.1830ms       408  2.8990us  2.4960us  3.7440us  __pgi_dev_cumemset_16n
                    0.47%  1.1818ms        80  14.772us  14.496us  17.216us  g2_kin_gpu_40_gpu
                    0.44%  1.1150ms       169  6.5970us  5.6960us  9.1200us  void trsm_left_kernel<double2, int=256, int=4, bool=0, bool=1, bool=1, bool=1, bool=0>(cublasTrsmParams<double2>, double2, double2 const *, int)
                    0.42%  1.0619ms        52  20.420us  18.944us  27.136us  volta_zgemm_32x32_cn
                    0.42%  1.0610ms        70  15.157us  15.104us  16.032us  sum_band_k_gpu_837_gpu
                    0.40%  1.0224ms       219  4.6680us  4.2240us  5.4720us  void lansy_M_stage1<double2, double, int=8>(int, double2 const *, unsigned long, double*, int)
                    0.40%  1.0046ms        90  11.162us  11.040us  11.488us  dylmr2_gpu_90_gpu
                    0.39%  984.57us        80  12.307us  12.223us  12.928us  atomic_wfc___gpu_396_gpu
                    0.37%  946.72us        80  11.833us  11.744us  12.224us  compute_deff_gpu_41_gpu
                    0.36%  909.82us       689  1.3200us  1.2480us  2.0160us  [CUDA memset]
                    0.34%  856.35us       219  3.9100us  3.8080us  5.6000us  void batch_symmetrize_kernel<double2, int=5, int=3>(int, double2*, unsigned long, __int64, int, int)
                    0.34%  855.00us        30  28.500us  28.352us  29.568us  gen_us_dy_gpu_229_gpu
                    0.33%  842.37us        90  9.3590us  9.2480us  9.8240us  dylmr2_gpu_101_gpu
                    0.33%  827.00us        90  9.1880us  9.0230us  10.048us  dylmr2_gpu_60_gpu
                    0.30%  772.22us       219  3.5260us  3.4870us  4.8000us  void lansy_M_stage2<double, int=8>(int, double*)
                    0.29%  745.95us        30  24.865us  24.831us  25.120us  gen_us_dy_gpu_198_gpu
                    0.28%  703.80us        30  23.460us  23.423us  24.128us  gen_us_dy_gpu_146_gpu
                    0.27%  690.78us       219  3.1540us  3.0720us  3.7120us  void lapack_lacpy_kernel<double, int=8>(int, int, double const *, int, double*, int, int, int)
                    0.27%  685.82us       219  3.1310us  3.0390us  3.6480us  void laed0_phase1_kernel<double, int=8>(int, double const *, int, int const *, double*, int, int, int)
                    0.25%  644.64us       219  2.9430us  2.8800us  3.9040us  void stedcx_convert_kernel<double2, double, int=8>(int, int, double const *, int, double2*, int)
                    0.25%  642.30us       219  2.9320us  2.8800us  3.2960us  void lacpy_kernel<double2, double2, int=5, int=3>(int, int, double2 const *, unsigned long, double2*, unsigned long, int, int)
                    0.25%  623.36us       219  2.8460us  2.8150us  3.2000us  potrf_alg2_reset_info(int*)
                    0.24%  598.37us       219  2.7320us  2.6880us  2.8800us  dtrsv_init_up(int*, int)
                    0.24%  596.93us       219  2.7250us  2.6880us  3.2320us  potrf_alg2_set_info(int, int, int*)
                    0.22%  558.62us        30  18.620us  18.432us  18.911us  gen_us_dy_gpu_85_gpu
                    0.21%  525.28us        70  7.5030us  7.4560us  7.6160us  diag_bands_k_693_gpu
                    0.18%  457.21us        30  15.240us  15.136us  15.968us  force_us_gpu_104_gpu
                    0.18%  456.89us        50  9.1370us  8.9910us  14.144us  void trsm_lt_up_kernel<double2, unsigned int=32, unsigned int=32, unsigned int=4, bool=0, bool=1>(int, int, double2 const *, int, double2*, int, double2, double2 const *, int, int*)
                    0.18%  454.24us        30  15.141us  15.040us  17.024us  gen_us_dy_gpu_185_gpu
                    0.18%  453.47us        70  6.4780us  6.4320us  6.7520us  dp_dev_memset_r2d_1431_gpu
                    0.17%  437.12us        20  21.856us  21.632us  23.712us  atomic_wfc_gpu_108_gpu
                    0.17%  427.58us        20  21.379us  20.992us  23.104us  interp_atwfc_gpu_30_gpu
                    0.15%  381.34us        30  12.711us  12.608us  13.184us  gen_us_dy_gpu_102_gpu
                    0.14%  362.69us        60  6.0440us  5.9510us  6.2720us  gen_us_dy_gpu_220_gpu
                    0.13%  334.53us        78  4.2880us  3.9040us  5.5360us  void gemv2N_kernel<int, int, double2, double2, double2, double2, int=128, int=16, int=4, int=4, int=1, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>, double2>>(double2 const )
                    0.12%  298.91us         1  298.91us  298.91us  298.91us  compute_dvloc_gpum_compute_dvloc_gpu_
                    0.10%  255.07us        10  25.507us  25.280us  27.392us  gen_us_dj_gpu_206_gpu
                    0.10%  248.74us        10  24.873us  24.800us  25.216us  gen_us_dj_gpu_173_gpu
                    0.10%  243.93us        10  24.393us  24.256us  25.440us  gen_us_dj_gpu_119_gpu
                    0.08%  204.67us        30  6.8220us  6.7520us  6.9760us  gen_us_dy_gpu_112_gpu
                    0.08%  198.24us        52  3.8120us  3.5520us  4.9280us  void splitKreduce_kernel<double2, double2, double2, double2>(cublasSplitKParams<double2>, double2 const *, double2 const *, double2*, double2 const *, double2 const *, double2 const *)
                    0.08%  197.82us        52  3.8040us  3.6480us  4.7040us  void gemvNSP_kernel<double2, double2, double2, double2, int=1, int=32, int=4, int=1024, cublasGemvParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>, double2>>(double2 const )
                    0.08%  194.37us        10  19.436us  19.072us  20.832us  init_wfc_gpu_295_gpu
                    0.07%  186.46us        10  18.646us  18.592us  18.816us  gen_us_dj_gpu_73_gpu
                    0.07%  182.18us        10  18.217us  18.176us  18.399us  stres_knl_gpu_84_gpu
                    0.07%  173.02us        20  8.6510us  8.6400us  8.8320us  cegterg_gpu_288_gpu
                    0.07%  172.42us        20  8.6200us  8.5120us  9.0560us  stres_us_gpu_131_gpu
                    0.07%  171.01us        10  17.100us  17.024us  17.376us  atomic_wfc_gpu_70_gpu
                    0.06%  152.13us        10  15.212us  15.071us  16.384us  gen_us_dj_gpu_160_gpu
                    0.05%  137.73us        50  2.7540us  2.7200us  2.9760us  dtrsv_init(int*)
                    0.05%  135.39us         2  67.695us  64.959us  70.432us  force_corr_gpu_124_gpu
                    0.05%  123.78us        20  6.1880us  5.8880us  6.7520us  void gemv2T_kernel_val<int, int, double2, double2, double2, double2, int=128, int=16, int=4, int=4, bool=1, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>, double2>>(double2 const , double2, double2)
                    0.05%  120.93us        20  6.0460us  5.9520us  6.3680us  gen_us_dj_gpu_197_gpu
                    0.04%  103.62us        10  10.361us  10.304us  10.848us  stres_us_gpu_91_gpu
                    0.04%  96.448us         7  13.778us  13.568us  14.176us  dfunct_gpum_newd_gpu_311_gpu
                    0.04%  94.400us         1  94.400us  94.400us  94.400us  stres_ewa_gpu_155_gpu
                    0.03%  72.992us        10  7.2990us  7.1360us  8.4160us  init_wfc_gpu_391_gpu
                    0.03%  72.800us         2  36.400us  34.432us  38.368us  force_lc_gpu_119_gpu
                    0.03%  72.768us         1  72.768us  72.768us  72.768us  stres_har_gpu_77_gpu
                    0.03%  69.888us        10  6.9880us  6.8480us  7.4240us  atomic_wfc_gpu_85_gpu
                    0.03%  67.520us         1  67.520us  67.520us  67.520us  stres_loc_gpu_155_gpu
                    0.02%  59.712us        10  5.9710us  5.8880us  6.2080us  rotate_wfc_k_gpu_132_gpu
                    0.01%  24.384us         6  4.0640us  3.7760us  4.9600us  void reduce_1Block_kernel<double2, int=64, int=6, cublasGemvTensorStridedBatched<double2>, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>>(double2 const *, double2, double2, int, double2 const *, double2, cublasGemvTensorStridedBatched<double2>, double2 const , cublasPointerMode_t, cublasLtEpilogue_t, cublasGemvTensorStridedBatched<biasType<double2 const value_type, double2>::type const >)
                    0.01%  24.224us         6  4.0370us  3.7760us  4.8960us  void dot_kernel<double2, int=64, int=1, cublasDotParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>>>(double2 const )
                    0.01%  21.568us         1  21.568us  21.568us  21.568us  stres_loc_gpu_98_gpu
                    0.01%  15.264us         6  2.5440us  2.4640us  2.8160us  __pgi_dev_cumemset_4n
                    0.00%  9.7280us         1  9.7280us  9.7280us  9.7280us  dvloc_of_g_gpu_184_gpu
      API calls:   56.54%  877.99ms      1715  511.95us     489ns  409.99ms  cudaFree
                   19.84%  308.14ms       900  342.37us  1.4400us  295.87ms  cudaDeviceSynchronize
                    7.03%  109.13ms     20152  5.4150us  4.5100us  310.44us  cudaLaunchKernel
                    4.31%  66.931ms      1542  43.405us  4.6000us  3.8148ms  cudaMemcpy
                    2.19%  34.061ms      2479  13.739us  3.8100us  180.48us  cudaMemcpyAsync
                    2.12%  32.959ms      2557  12.889us  4.6510us  239.27us  cudaEventSynchronize
                    1.43%  22.244ms        20  1.1122ms  822.92us  2.3907ms  cuDeviceTotalMem
                    1.11%  17.296ms      6645  2.6020us     749ns  186.38us  cudaEventRecord
                    0.93%  14.380ms      1744  8.2450us  1.8290us  1.3001ms  cudaMalloc
                    0.75%  11.621ms      1977  5.8780us     149ns  1.6835ms  cuDeviceGetAttribute
                    0.57%  8.8800ms     20143     440ns     330ns  287.69us  cudaDeviceGetAttribute
                    0.49%  7.6111ms      1656  4.5960us  4.0700us  31.689us  cuLaunchKernel
                    0.33%  5.1501ms     10579     486ns     330ns  239.62us  cudaGetDevice
                    0.29%  4.4656ms         6  744.27us  448.31us  2.1013ms  cudaGetDeviceProperties
                    0.28%  4.4199ms     10835     407ns     150ns  2.2176ms  cudaGetLastError
                    0.25%  3.8660ms      1384  2.7930us  1.8200us  8.4200us  cudaStreamSynchronize
                    0.20%  3.1513ms       689  4.5730us  3.3890us  20.390us  cudaMemsetAsync
                    0.19%  3.0171ms      2557  1.1790us  1.0100us  11.680us  cudaEventElapsedTime
                    0.15%  2.3771ms       256  9.2850us  1.9900us  152.75us  cudaSetDevice
                    0.15%  2.2786ms      1524  1.4950us     780ns  12.790us  cudaEventQuery
                    0.14%  2.1870ms       145  15.083us  7.2200us  21.080us  cudaMemcpy2D
                    0.11%  1.7847ms       147  12.140us  4.5000us  738.97us  cudaMallocHost
                    0.11%  1.7611ms      2336     753ns     469ns  12.960us  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.09%  1.3806ms        20  69.028us  41.230us  387.93us  cuDeviceGetName
                    0.09%  1.3584ms       133  10.213us  4.9500us  107.07us  cudaMemcpyToSymbol
                    0.09%  1.3446ms       508  2.6460us  2.2900us  14.350us  cudaFuncGetAttributes
                    0.05%  771.33us       146  5.2830us  3.7500us  20.409us  cudaFreeHost
                    0.04%  625.29us        44  14.211us  1.3800us  205.11us  cudaStreamCreate
                    0.02%  380.08us       552     688ns     510ns  3.6400us  cudaStreamIsCapturing
                    0.02%  359.66us        44  8.1740us  3.8090us  92.571us  cudaStreamDestroy
                    0.01%  195.34us       267     731ns     620ns  15.100us  cudaEventCreate
                    0.01%  170.44us       562     303ns     200ns  1.2400us  cuCtxPushCurrent
                    0.01%  158.23us       562     281ns     200ns     810ns  cuCtxPopCurrent
                    0.01%  116.94us       146     800ns     480ns  2.9910us  cudaPointerGetAttributes
                    0.00%  54.041us        90     600ns     460ns  2.8110us  cudaEventCreateWithFlags
                    0.00%  40.090us         3  13.363us  2.4000us  32.530us  cudaStreamCreateWithFlags
                    0.00%  20.707us        24     862ns     250ns  6.3000us  cuDeviceGet
                    0.00%  18.040us         4  4.5100us  1.8300us  9.0200us  cuDeviceGetPCIBusId
                    0.00%  17.489us         4  4.3720us  2.5690us  9.3200us  cuInit
                    0.00%  16.104us        45     357ns     180ns  1.9900us  cudaGetFuncBySymbol
                    0.00%  13.147us         8  1.6430us  1.3110us  3.2490us  cudaEventDestroy
                    0.00%  5.2070us        20     260ns     150ns     580ns  cuDeviceGetUuid
                    0.00%  3.3580us         7     479ns     230ns     940ns  cuDeviceGetCount
                    0.00%  2.6790us        10     267ns     180ns     360ns  cuCtxGetCurrent
                    0.00%  1.2700us         2     635ns     190ns  1.0800us  cudaGetDeviceCount
                    0.00%  1.1300us         4     282ns     240ns     380ns  cuDriverGetVersion
                    0.00%     920ns         5     184ns     170ns     200ns  cuCtxGetDevice
                    0.00%     309ns         1     309ns     309ns     309ns  cudaDriverGetVersion
                    0.00%     200ns         1     200ns     200ns     200ns  cudaRuntimeGetVersion

Compile with ICC

Compiling with intel icc with fftw library.

spack load intel-oneapi-compilers@2021.1.2

spack load intel-parallel-studio@cluster-2020.2

spack load netlib-lapack@3.9.1/nbc

spack load openmpi@4.1.1/jip

./configure --prefix=/home/qe/fftw-3.3.9 F77=ifort CC=icc CFLAGS="-O3 -g -march=native" FFLAGS="-O3 -g" -enable-openmp

make -j 128 all

If the option -march=native is added in FFLAGS, ifort will throw an error

ifort: error #10106: Fatal error in /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/compiler/2021.1.2/linux/bin/intel64/../../bin/intel64/fortcom, terminated by segmentation violation

Tuning with different number of MPI processes and OpenMP threads on one node, 32 processes with 8 threads each got the best performance in testcase AUSURF112.

PWSCF : 37m 3.31s CPU 4m46.48s WALL

Compile with AOCC

spack load aocc@3.0.0/46t
spack load amdfftw@3.0
spack load openmpi@4.1.1/nqq
export F90=flang
export F77=flang
export FC=flang
export CC=clang
export CXX=clang++

./configure --enable-parallel --enable-openmp CFLAGS="-O3 -g -march=znver2" FFLAGS="-O3 -g -march=znver2" FFT_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3.a /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_omp.a /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_threads.a" BLAS_LIBS=/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdblis-3.0-avcgn4ja67j4wz5euv6usv4rt2okvytg/lib/libblis-mt.a LAPACK_LIBS=/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdlibflame-3.0-6tev4j6setn6jmojmydlnz3qi4bn5qrs/lib/libflame.a  MPI_LIBS="-L/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/openmpi-4.1.1-nqqearshseiwkncy5roqcqij5dieen3p/lib" DFLAGS="-D__FFTW3 -D__MPI" IFLAGS="-I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/include -I/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdblis-3.0-avcgn4ja67j4wz5euv6usv4rt2okvytg/include -I/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdlibflame-3.0-6tev4j6setn6jmojmydlnz3qi4bn5qrs/include -I/home/qe/q-e/include"

pitfall: qe configure does not recognize flang. Need to change F90=flang in make.inc manually.

This version cannot pass the test and AUSURF112 benchmark does not converge. (Errors may be brought by the libraries)

All done. ERROR: only 166 out of 221 tests passed.
Failed tests in:
        /home/qe/q-e/test-suite/pw_b3lyp/
        /home/qe/q-e/test-suite/pw_berry/
        /home/qe/q-e/test-suite/pw_cluster/
        /home/qe/q-e/test-suite/pw_electric/
        /home/qe/q-e/test-suite/pw_lda+U/
        /home/qe/q-e/test-suite/pw_lsda/
        /home/qe/q-e/test-suite/pw_md/
        /home/qe/q-e/test-suite/pw_metaGGA/
        /home/qe/q-e/test-suite/pw_metal/
        /home/qe/q-e/test-suite/pw_noncolin/
        /home/qe/q-e/test-suite/pw_pawatom/
        /home/qe/q-e/test-suite/pw_realspace/
        /home/qe/q-e/test-suite/pw_relax/
        /home/qe/q-e/test-suite/pw_scf/
        /home/qe/q-e/test-suite/pw_spinorbit/
        /home/qe/q-e/test-suite/pw_uspp/
        /home/qe/q-e/test-suite/pw_vc-relax/
        /home/qe/q-e/test-suite/pw_vdw/
        /home/qe/q-e/test-suite/pw_workflow_relax_relax/
        /home/qe/q-e/test-suite/pw_workflow_scf_dos/
        /home/qe/q-e/test-suite/pw_workflow_vc-relax_dos/
        /home/qe/q-e/test-suite/pw_workflow_vc-relax_scf/
        

     starting charge 1230.69946, renormalised to 1232.00000

     negative rho (up, down):  3.043E+00 0.000E+00
     Starting wfcs are 1008 randomized atomic wfcs
[epyc.node1:216922] 127 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[epyc.node1:216922] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[epyc.node1:216922] 127 more processes have sent help message help-btl-vader.txt / knem permission denied

     total cpu time spent up to now is       22.9 secs

     Self-consistent Calculation

     iteration #  1     ecut=    25.00 Ry     beta= 0.70
     Davidson diagonalization with overlap
     ethr =  1.00E-02,  avg # of iterations =  5.0

     Threshold (ethr) on eigenvalues was too large:
     Diagonalizing with lowered threshold

     Davidson diagonalization with overlap
     ethr =  4.37E-04,  avg # of iterations = 18.5

     negative rho (up, down):  2.992E+00 0.000E+00

     total cpu time spent up to now is      430.1 secs

     total energy              =  -11423.48971757 Ry
     estimated scf accuracy    <       6.31636318 Ry

     iteration #  2     ecut=    25.00 Ry     beta= 0.70
     Davidson diagonalization with overlap
     ethr =  5.13E-04,  avg # of iterations = 15.5

     negative rho (up, down):  2.993E+00 0.000E+00

     total cpu time spent up to now is      795.7 secs

     total energy              =  -11408.37987998 Ry
     estimated scf accuracy    <     196.19698446 Ry

     End of self-consistent calculation

     convergence NOT achieved after   2 iterations: stopping

     Writing output data file ./ausurf.save/
[epyc:216930:0:216930] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc7000)
==== backtrace (tid: 216930) ====
 0  /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.3.0/ucx-1.10.1-xby34b5gbwxi5cknbevj4wlbs34hyri6/lib/libucs.so.0(ucs_handle_error+0x254) [0x7fd0b3b587d4]
 1  /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.3.0/ucx-1.10.1-xby34b5gbwxi5cknbevj4wlbs34hyri6/lib/libucs.so.0(+0x269b7) [0x7fd0b3b589b7]
 2  /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.3.0/ucx-1.10.1-xby34b5gbwxi5cknbevj4wlbs34hyri6/lib/libucs.so.0(+0x26c8e) [0x7fd0b3b58c8e]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fd0b4180730]
 4  /home/qe/q-e/bin/pw.x() [0x11e3890]
 5  /home/qe/q-e/bin/pw.x() [0x11e3e47]
 6  /home/qe/q-e/bin/pw.x() [0x11ef0ce]
 7  /home/qe/q-e/bin/pw.x() [0x117a124]
 8  /home/qe/q-e/bin/pw.x() [0x9087e0]
 9  /home/qe/q-e/bin/pw.x() [0x9085c7]
10  /home/qe/q-e/bin/pw.x() [0x9084f7]
11  /home/qe/q-e/bin/pw.x() [0x906c58]
12  /home/qe/q-e/bin/pw.x() [0x920797]
13  /home/qe/q-e/bin/pw.x() [0x682772]
14  /home/qe/q-e/bin/pw.x() [0x67ca67]
15  /home/qe/q-e/bin/pw.x() [0x6a889f]
16  /home/qe/q-e/bin/pw.x() [0x4c8406]
17  /home/qe/q-e/bin/pw.x() [0x18baa23]
18  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fd0b3fd109b]
19  /home/qe/q-e/bin/pw.x() [0x4c81da]
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node epyc exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Compile with GCC

Specify the mkl libraries manually.

spack load gcc@10.2.0/3xz
spack load openmpi@4.1.1/n46

./configure --enable-parallel --with-scalapack=yes --enable-openmp CFLAGS="-O3 -g -march=znver2" FFLAGS="-O3 -g -march=znver2 -fallow-argument-mismatch" FFT_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_omp.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_threads.a" \
BLAS_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_gf_lp64.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_sequential.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_core.a" \
LAPACK_LIBS=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_lapack95_lp64.a \
SCALAPACK_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_scalapack_ilp64.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_blacs_openmpi_lp64.a" \
MPI_LIBS="-L/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/openmpi-4.1.1-n46i3ctamj3tnmnd7qfzhabdweajbgsn/lib" \
DFLAGS="-D__FFTW3 -D__MPI -D__SCALAPACK" \
IFLAGS="-I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/include -I/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdblis-3.0-avcgn4ja67j4wz5euv6usv4rt2okvytg/include -I/home/qe/q-e/include"

Error to be fixed:

/usr/bin/ld: /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_core.a(mkl_memory_patched.o): undefined reference to symbol 'dlclose@@GLIBC_2.2.5' /usr/bin/ld: //lib/x86_64-linux-gnu/libdl.so.2: error adding symbols: DSO missing from command line collect2: error: ld returned 1 exit status

Misc

The library used in Q-E compiled by intel compiler:

BLAS_LIBS= -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

SCALAPACK_LIBS=-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64

FFT_LIBS= fftw-3.3.9

 init_run     :    158.19s CPU     21.00s WALL (       1 calls)
 electrons    :   2063.54s CPU    264.73s WALL (       1 calls)

 Called by init_run:
 wfcinit      :    148.40s CPU     19.08s WALL (       1 calls)
 potinit      :      1.84s CPU      0.24s WALL (       1 calls)
 hinit0       :      2.63s CPU      0.50s WALL (       1 calls)

 Called by electrons:
 c_bands      :   1937.22s CPU    247.62s WALL (       3 calls)
 sum_band     :    116.01s CPU     15.64s WALL (       3 calls)
 v_of_rho     :      2.32s CPU      0.30s WALL (       3 calls)
 newd         :     12.90s CPU      1.87s WALL (       3 calls)
 mix_rho      :      0.29s CPU      0.04s WALL (       3 calls)

 Called by c_bands:
 init_us_2    :      1.41s CPU      0.29s WALL (      14 calls)
 cegterg      :   1931.14s CPU    246.85s WALL (       6 calls)

 Called by *egterg:
 cdiaghg      :    304.65s CPU     38.94s WALL (      81 calls)
 h_psi        :    656.99s CPU     84.10s WALL (      85 calls)
 s_psi        :    145.97s CPU     18.38s WALL (      85 calls)
 g_psi        :      0.31s CPU      0.05s WALL (      77 calls)

 Called by h_psi:
 h_psi:calbec :    183.87s CPU     23.70s WALL (      85 calls)
 vloc_psi     :    321.07s CPU     41.10s WALL (      85 calls)
 add_vuspsi   :    150.67s CPU     19.07s WALL (      85 calls)

 General routines
 calbec       :    232.51s CPU     30.03s WALL (      91 calls)
 fft          :      3.38s CPU      0.44s WALL (      40 calls)
 ffts         :      0.93s CPU      0.15s WALL (       6 calls)
 fftw         :    348.65s CPU     44.30s WALL (   37782 calls)
 interpolate  :      0.26s CPU      0.03s WALL (       3 calls)
 davcio       :      0.04s CPU      0.27s WALL (       6 calls)

compiler option --march=native has no significant effect on speed

Try to run on two nodes, but failed

spack load intel-parallel-studio@cluster-2020.2

spack load openmpi@4.1.1/jip

spack load ucx/gji

mpirun --prefix /opt/spack/opt/spack/linux-debian10-zen2/intel-2021.1.2/openmpi-4.1.1-jipfb67ngxddcblg4rcsjuu47pskabrs/ -np 64 -hostfile ./hostfile -mca pml ucx -x UCX_TLS=rc_x,sm,self -x UCX_NET_DEVICES=mlx5_0:1 -x PATH -x LD_LIBRARY_PATH --oversubscribe /home/qe/q-e/bin/pw.x < ./ausurf.in

Set up the remote node when login non-interactively

add to .bashrc

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nonspack/ucx-1.10.0-gcc/lib . /opt/spack/share/spack/setup-env.sh spack load intel-parallel-studio@cluster-2020.2 spack load openmpi@4.1.1/jip spack load ucx/gji

A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find.

Host: epyc.node2 Framework: pml Component: ucx

Arm Forge MAP Result

Original code compiled by intel compiler with mkl. testcase AUSURF112.

Profiling                 : /home/qe/q-e/bin/pw.x -i ./ausurf.in
Allinea sampler           : preload
MPI implementation        : Auto-Detect (Open MPI)
* MPI arguments           
* number of processes     : 32
* number of nodes         : 1
* Allinea MPI wrapper     : preload (precompiled)
Input file                : <stdin>
Working directory         : /home/qe/benchmarks/sb/AUSURF112
Number of OpenMP threads  : 8
Queue enabled             : No
System config file        : /home/qe/.allinea/system.config
                          
OMP_NUM_THREADS (env var) : 8
Full target path          : /home/qe/q-e/PW/src/pw.x
Launched from host        : epyc.node1
                          
Run started               : Sat Aug 28 07:04:24 2021
Sampling started          : Sat Aug 28 07:04:24 2021
Sampling stopped          : Sat Aug 28 07:09:39 2021
Runtime                   : 354s
Sampled runtime           : 315s

CPU floating-point: 38.2%

CPU memory access: 15.9%

CPU fp vector: 38.0%

CPU branch: 7.4%

Memory usage: 676MB

pcegterg_IP_ functions took a lot of time in synchronization mpi_barrier which is even greater than the actual calculating time.

Compile Option

NVHPC

# LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/nvhpc-21.5-qrsvxrpkmqhxy2coxes2qzcfhirsy5uv/Linux_x86_64/21.5/comm_libs/openmpi4/openmpi-4.0.5/lib
spack load nvhpc@21.5/djb
spack load /tyv #hdf5

OneAPI

LD_LIBRARY_PATH=/opt/intel/oneapi/vpl/2021.4.0/lib:/opt/intel/oneapi/tbb/2021.3.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.3.1//libfabric/lib:/opt/intel/oneapi/mpi/2021.3.1//lib/release:/opt/intel/oneapi/mpi/2021.3.1//lib:/opt/intel/oneapi/mkl/2021.3.0/lib/intel64:/opt/intel/oneapi/itac/2021.3.0/slib:/opt/intel/oneapi/ipp/2021.3.0/lib/intel64:/opt/intel/oneapi/ippcp/2021.3.0/lib/intel64:/opt/intel/oneapi/ipp/2021.3.0/lib/intel64:/opt/intel/oneapi/dnnl/2021.3.0/cpu_dpcpp_gpu_dpcpp/lib:/opt/intel/oneapi/debugger/10.1.2/gdb/intel64/lib:/opt/intel/oneapi/debugger/10.1.2/libipt/intel64/lib:/opt/intel/oneapi/debugger/10.1.2/dep/lib:/opt/intel/oneapi/dal/2021.3.0/lib/intel64:/opt/intel/oneapi/compiler/2021.3.0/linux/lib:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/x64:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/emu:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/oclfpga/linux64/lib:/opt/intel/oneapi/compiler/2021.3.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/ccl/2021.3.0/lib/cpu_gpu_dpcpp:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/nvhpc-21.5-qrsvxrpkmqhxy2coxes2qzcfhirsy5uv/Linux_x86_64/21.5/compilers/lib:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/openssl-1.1.1k-v735mywfwhu5wwrc6rcppju7lxvoxegh/lib:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/zlib-1.2.11-aim3z46oucbopx4jmsvi6rj23psecql5/lib:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/ncurses-6.2-zdp3gdfsnlvphj7kpsgsfk3jvtxvuvz7/lib:/opt/intel/oneapi/mpi/2021.3.1//lib/release/

pitfalls

https://github.com/MPAS-Dev/MPAS-Model/issues/554
https://forums.developer.nvidia.com/t/problem-with-nvfortran-and-r/155366
LibGOMP not IMPLEMENTED: fftw/scalapack/hdf5/elpa is not dependent on the compiler's lib.

performance

#if defined(__GPU_MPI) ierr = cudaDeviceSynchronize() ! This syncs __GPU_MPI case CALL bcast_integer_gpu( msg_d, msglen, source, group ) RETURN ! Sync done by MPI call (or inside bcast_xxx_gpu)```bash nvfortran 21.2-0 LLVM 64-bit target on x86-64 Linux -tp zen NVIDIA Compilers and Tools Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

1. GPU single thread 
```bash
real	1m51.316s
user	51m9.972s
sys	4m59.190s

GPU 4 thread

real    1m34.486s
user    2m12.550s

4 GPU 4 threads

real	6m26.432s
user	4h20m2.947s
sys	4h24.789s

8 GPU 2 node 4 threads

real	4m42.563s
user	1h24m6.227s
sys	2h0m4.267s

MPI + Cuda seems to call diffent routines of GPU implementation, which communication always hold the bounds.

#pragma acc host_data use_device(s_buf) MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); 
... 
#pragma acc update host(s_buf[0:size] ) MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

So we are going to try GPU direct MPI.

#if defined(__GPU_MPI)
        ierr = cudaDeviceSynchronize()  ! This syncs __GPU_MPI case
        CALL bcast_integer_gpu( msg_d, msglen, source, group )
        RETURN ! Sync done by MPI call (or inside bcast_xxx_gpu)

But CUBLAS and other GPU code is just fine for one thread.

#if defined(__CUDA)
      USE cudafor
      USE cublas
#endif
      
      IMPLICIT NONE

      SAVE

      PRIVATE

      REAL(DP) :: one, zero, two, minus_one, minus_two
      PARAMETER ( one = 1.0d0, zero = 0.0d0, two = 2.0d0, minus_one = -1.0d0 )
      PARAMETER ( minus_two = -2.0d0 )
      COMPLEX(DP) :: cone, czero, mcone
      PARAMETER ( cone = (1.0d0, 0.0d0), czero = (0.0d0, 0.0d0) )
      PARAMETER ( mcone = (-1.0d0, 0.0d0) )
      REAL(DP) :: small = 1.0d-14
      LOGICAL :: use_parallel_diag 

      PUBLIC :: sigset
      PUBLIC :: tauset
      PUBLIC :: rhoset
      PUBLIC :: ortho_iterate
      PUBLIC :: updatc, calphi_bgrp
      PUBLIC :: mesure_diag_perf, mesure_mmul_perf
      PUBLIC :: use_parallel_diag
      PUBLIC :: bec_bgrp2ortho

      REAL(DP), ALLOCATABLE DEVICEATTR :: tmp1(:,:), tmp2(:,:), dd(:,:), tr1(:,:), tr2(:,:)
      REAL(DP), ALLOCATABLE DEVICEATTR :: con(:,:), x1(:,:)

CONTAINS

   SUBROUTINE allocate_local_arrays(ldx)
      INTEGER, INTENT(IN) :: ldx
      IF( ALLOCATED( tr1 ) ) THEN
         IF( SIZE( tr1, 1 ) /= ldx ) THEN
            DEALLOCATE( tmp1, tmp2, dd, x1, con )
            DEALLOCATE( tr1, tr2 )
         END IF
      END IF
      IF( .NOT. ALLOCATED( tr1 ) ) THEN
         ALLOCATE( tr1(ldx,ldx), tr2(ldx,ldx) )
         ALLOCATE( tmp1(ldx,ldx), tmp2(ldx,ldx), dd(ldx,ldx), x1(ldx,ldx), con(ldx,ldx) )
      END IF
   END SUBROUTINE allocate_local_arrays

   SUBROUTINE deallocate_local_arrays()
      IF( ALLOCATED( tr1 ) ) DEALLOCATE( tr1 )
      IF( ALLOCATED( tr2 ) ) DEALLOCATE( tr2 )
      IF( ALLOCATED( tmp1 ) ) DEALLOCATE( tmp1 )
      IF( ALLOCATED( tmp2 ) ) DEALLOCATE( tmp2 )
      IF( ALLOCATED( dd ) ) DEALLOCATE( dd )
      IF( ALLOCATED( x1 ) ) DEALLOCATE( x1 )
      IF( ALLOCATED( con ) ) DEALLOCATE( con )
   END SUBROUTINE deallocate_local_arrays

   SUBROUTINE clear_unused_elements( x, idesc )
      !
      !  Clear elements not involved in the orthogonalization
      !
      IMPLICIT NONE
      REAL(DP) DEVICEATTR :: x(:,:)
      INTEGER, INTENT(IN) :: idesc(:)
      INTEGER :: nr, nc, i, j
      INCLUDE 'laxlib.fh'
      IF( idesc(LAX_DESC_ACTIVE_NODE) < 0 ) then
         x = 0.0d0
      ELSE
         nr = idesc(LAX_DESC_NR)
         nc = idesc(LAX_DESC_NC)
!$cuf kernel do(2) <<<*,*>>>
         do j = nc + 1, SIZE( x, 2 )
            do i = 1, SIZE( x, 1 )
               x( i, j ) = 0.0d0
            end do
         end do
!$cuf kernel do(2) <<<*,*>>>
         do j = 1, SIZE( x, 2 )
            do i = nr + 1, SIZE( x, 1 )
               x( i, j ) = 0.0d0
            end do
         end do
      END IF
   END SUBROUTINE

ramBLe

turn off hyperthreading

sudo su
echo off > /sys/devices/system/cpu/smt/control

/home/opc/ramBLe

boost 1.70.0 & mvapich2.3.3

turn off hyperthreading

sudo su
echo off > /sys/devices/system/cpu/smt/control

Gdrive

wget https://github.com/prasmussen/gdrive/releases/download/2.1.1/gdrive_2.1.1_linux_amd64.tar.gz
tar -zxvf gdrive_2.1.1_linux_amd64.tar.gz

wget https://forensics.cert.org/cert-forensics-tools-release-el7.rpm
sudo rpm -Uvh cert-forensics-tools-release*rpm
sudo yum --enablerepo=forensics install musl-libc -y

init env values

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/nfs/cluster/boost_1_70_0/stage/lib

source /home/opc/ramBLe/env.sh

mpi

source /opt/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-redhat7.9-x86_64/hpcx-init.sh
hpcx_load

mpirun -np 4 --display-map --map-by node -x MXM_RDMA_PORTS=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1

run

mpirun -np 144 --display-map --hostfile hostfiles  -x MXM_RDMA_PORTS=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1  -x UCX_NET_DEVICES=mlx5_0:1  ./ramble -f test/coronary.csv -n 6 -m 1841 -d -o test/coronary.dot

[opc@inst-dahrf-splendid-walrus ramBLe]$ cat hostfiles
hpc-node-1 slots=36
hpc-node-2 slots=36
hpc-node-3 slots=36
hpc-node-4 slots=36

tab is $'\t'

python run experinment

at /nfs/cluster/ramBle_hpcg

python common/scripts/ramble_experiments.py \
-p 16 -r 1 -a gs -d /nfs/scratch/C1_discretized.tsv -s '\t' -v \
--results result\c1.csv

mpirun -np 144 --display-map --hostfile hostfiles  -x MXM_RDMA_PORTS=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1  -x UCX_NET_DEVICES=mlx5_0:1  ./ramble -f /nfs/scratch/C1_discretized.tsv -m 29150 -n 5164  -s $'\t' -v -i -d -o test/c1.dot

mpirun -np 1 \
--display-map \
--hostfile hostfiles \
-x MXM_RDMA_PORTS=mlx5_0:1 \
-mca btl_openib_if_include mlx5_0:1 \
-x UCX_NET_DEVICES=mlx5_0:1 \
./ramble -f /nfs/scratch/C1_discretized.tsv -s $'\t' \
-n 29150 -m 5164 \
 -c -v -i -d -o test/c1_2.dot >> result/hp_1

mpirun -np 144 \
--hostfile hostfiles \
-x MXM_RDMA_PORTS=mlx5_0:1 \
-mca btl_openib_if_include mlx5_0:1 \
-x UCX_NET_DEVICES=mlx5_0:1 \
./ramble -f test/coronary.csv -s ',' -n 6 -m 1841 -d -o test/coronary.dot

Auto Run script

murez/SC21_SCript/ramble

Gdrive

gdrive download 1UdrvrUPBQRjQafeOn5gHENz9wCrOrX-F    # ramBLe_hpcx.tar.gz
gdrive download 1QmW1RF6mvnepQ3hawMNK46MoDRNq8YGx    # boost_1_70_0_compiled.tar.gz

Install

Lib

Boost

just add the code into SConstruct to tell scons where is the boost lib is.

libPaths.append("/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/boost-1.70.0-m4ttgcfqixwe22z5kz7bpp7mbqdspdbg/lib")
cppPaths.append("/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/boost-1.70.0-m4ttgcfqixwe22z5kz7bpp7mbqdspdbg/include")

Cardioid

Repo: https://github.com/LLNL/cardioid

编译

如果使用 mfem，可能需要手动指定其路径。

自动编译

由于上游的 cardioid 安装包存在编译时会卡死的问题，因此需要手动修补安装文件。

首先运行 spack edit cardioid 命令，spack 将会启动文本编辑器。此后，在类 class Cardioid(CMakePackage) 的起始处加入以下内容

patch('https://gist.githubusercontent.com/KiruyaMomochi/cc4dfde7da51c3b11e45ab1079662693/raw/cardioid-cmake.patch',
    sha256='27e2b01a2a181d7364cf786f9da31193407b1aa9c20d0175965a3c772cc7378b')

此后使用 spack -d install -v cardioid 继续编译。

Spack 手动编译

以 fish shell 为例。

source /opt/spack/share/spack/setup-env.fish
spack stage cardioid+cuda
spack cd cardioid+cuda
spack build-env cardioid+cuda fish

纯手动编译

TODO

问题解决

Seg Fault with jemalloc

Happens when -nd >= 4

SIGTERM after finishing the job with -np >= 60

Some issue in the openmpi@4.1.1/jip

Use the Intel MPI

spack load intel-oneapi-compilers@2021.1.2

export F90=ifort
export F77=ifort
export FC=ifort
export CC=icc

export LD_LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mkl/2021.2.0/lib/intel64:$LD_LIBRARY_PATH

export LD_LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib:$LD_LIBRARY_PATH

export LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib:$LIBRARY_PATH

export LD_LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib/release_mt:$LD_LIBRARY_PATH

export LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib/release_mt:$LIBRARY_PATH

export PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/bin:$PATH

export MPI_LIBS=-L/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/libc

./configure --enable-parallel --with-scalapack=yes --enable-openmp CFLAGS="-march=core-avx2 -fma -ftz -fomit-frame-pointer -g" FFLAGS="-O3 -march=core-avx2 -align array64byte -fma -ftz -fomit-frame-pointer -g" SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -mkl=parallel -lifcore" IFLAGS="-I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mkl/2021.2.0/include -I/home/qe/q-e/include -I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/include"

This version is slower than before. Observations about the testcase AUSURF112: Cannot utilize hyperthreading efficiently, with 128 processes bind to core with OMP=1, it is faster than any combinations of number of processes and OMP_NUM_THREADS that utilizes all the hyperthreads.

About fftw

When we specify FFT_LIBS to configure of quantum-espresso 6.6, fft related macro are not defined. If FFTW_INCLUDE is defined, __FFTW is defined. Changing to amdfftw does not influence the running time.

export FFT_LIBS=-L/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib

export FFTW_INCLUDE=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/include

调试 CMake 项目

To debug a CMake project:

Find cmake command from spack's log
Create a new directory to save build files
cd to the path and run cmake command
append --trace-source=[path to CMakeLists.txt] to the cmake command

Use message command to print a middle result. For more information see CMake doc.

Spack install 安装时间太久

使用 spack -d install -v [package name] 输出调试日志。

如果问题出现在 cmake 期间，可能是函数 interface_link_libraries 的问题。该函数会递归的去寻找各个子项目的 include path，这时候相同的依赖会被多次 include。又因为 spack 环境中的 include path 很长，会生成超极长的include path（指数级），导致 cmake 卡死。

解决方式 1

在他递归生成时加入检查

foreach(lib ${libs})
    list(FIND searched ${lib} lib_has_been_searched)    
    #message(SEND_ERROR "+++ ${lib} ${lib_has_been_searched}")
    if (lib_has_been_searched EQUAL -1)
      get_recursive_list(recursive_val ${lib} ${prop} ${searched})
      foreach(val ${retval})
        if(NOT recursive_val)
          list(APPEND val ${recursive_val})
        else()
          if (val IN_LIST recursive_val)
            #message("Duplicate val!")
          else()
            list(APPEND val ${recursive_val})
          endif()
        endif()
      endforeach()
    endif()
endforeach()

解决方式 2

使用以下补丁，在递归后删除重复项。

diff --git a/elec/CMakeLists.txt b/elec/CMakeLists.txt
index 4a526cb..ca92d2d 100644
--- a/elec/CMakeLists.txt
+++ b/elec/CMakeLists.txt
@@ -271,7 +271,7 @@ function(get_recursive_list retvar target prop)
   list(APPEND searched ${target})
   #message(SEND_ERROR "=== ${target} ${prop} ${searched}")

-  set(${retval} "")
+  set(retval "")
   get_property(propval TARGET ${target} PROPERTY ${prop} SET)
   if (propval)
     get_target_property(propval ${target} ${prop})
@@ -288,6 +288,10 @@ function(get_recursive_list retvar target prop)
     endif()
   endforeach()

+  if(NOT retval)
+    list(REMOVE_DUPLICATES retval)
+  endif()
+
   set(${retvar} ${retval} PARENT_SCOPE)
   #message(SEND_ERROR "--- ${target} ${prop} ${retval}")
 endfunction()

无法连接上国际互联网

Set http proxy to 192.168.100.5:1082, or use

proxychains -q [command]

Config proxy for git

Set proxy

git config --global http.proxy http://192.168.100.5:1082

Unset proxy

git config --global --unset http.proxy

Rules

Mystery App

Install

# Load gcc and openmpi:
module load gcc-9.2.0 module load mpi
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
# Accept license agreements and select the
# right install location, probably not your
# home directory since disk quota is limited. #
# Activate conda, if you skipped it's auto initialization source /path/to/your/conda/install/bin/activate
# Turn on the conda base environment
conda activate
# Install pytorch
conda install pytorch cudatoolkit=11.3 -c pytorch
# Install build dependencies for larcv3:
conda install cmake hdf5 scikit-build
# Install Tensorflow:
pip install tensorflow
# NOTE: if you don't install tensorflow, you need to pip install numpy!
# Clone larcv and install it:
git clone https://github.com/DeepLearnPhysics/larcv3.git cd larcv3
git submodule update --init
python setup.py build -j 64
python setup.py install
# Install mpi4py:
pip install --force-reinstall mpi4py --no-cache-dir
# Install horovod with tensorflow or if you want it with pytorch:
pip install --force-reinstall horovod --no-cache-dir

參數

mode.optimizer.gradient_accumulation <= 1
mode.optimizer.learning_rate=123.456
mode.optimizer.name = "rmsprop" "adam"
mode.weights_location -> load checkpoint
mode.no_summary_images

run.compute_mode = DPCPP #？ data parallel Cpp intel MKL優化CPU

gradient_accumulation.....: 1
conf['mode']['optimizer']['learning_rate'] = 10.**random.uniform(-3.5, -2.5)
conf['mode']['optimizer']['loss_balance_scheme'] = random.choice(["none", "light", "focal"])
checkpoint_iteration........: 500

learning_rateloss_balance_scheme

SCC_21.yml

defaults:
  - _self_
  - network: SCC_21
  - framework: torch
  - mode: train
  - data: real
data:
  downsample: 0
run:
  distributed: true
  iterations: 500
  compute_mode: GPU
  aux_minibatch_size: ${run.minibatch_size}
  aux_iterations: 10
  id: ???
  precision: float32
  profile: false
  output_dir: output/${framework.name}/${network.name}/${run.id}/
  minibatch_size: 2
mode:
  optimizer: adam
    loss_balance_scheme: light

iotest

\[ \text{running number} / \text{iteration} = \text{minibatch} / \text{rank}\\ \text{throughput} = \frac{\text{all running number}}{\text{runing time}}\\ \text{throughput} = \frac{\text{running number} / \text{iteration}\times \text{iteration}}{\text{iteration}\times\text{average runing time}}\\ =\frac{ \text{minibatch} / \text{rank}}{\text{average runing time}} \\ =\frac{\text{minbatch}}{\text{rank}\times({\text{reading time + compute time})}} \]

SC 20

Rewind: https://victoryang00.cn/wordpress/2020/11/12/vscc20-%e6%80%bb%e7%bb%93/

Benchmark

这里放置HPC竞赛中 Benchmark 有关资料，文档部分用于对新生的培训。

.dat Specs

ASC20

HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
384 256 256
60

how to run

export PATH=/opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/ompi/bin:/opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2/:$PATH
export LD_LIBRARY_PATH=/opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/ompi/lib:$LD_LIBRARY_PATH
source /etc/profile.d/modules.sh
module use /opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/modulefiles
module load hpcx

mpirun --allow-run-as-root --hostfile host2_gpu4 --mca pml_base_verbose 100     --mca btl_base_verbose 100     --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1     --mca orte_base_help_aggregate=0     -x xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80

SC21

HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
256 256 512
1800

how to run

see binder.sh

HPL .dat config file

ASC18

The following is the HPL .dat configuration file template from ASC18.

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
67200 65280 62976 65280 96000 65280 38400 96000 102400 168960 153600 76800   142848 153600 142848 124416 96256 142848 124416 115200 110592 96256 Ns
1             # of NBs
384 768 384 768 1024 768 896 768 1024 512 384 640 768 896 960 1024 1152 1280 384 640 960 768 640 256  960 512 768 1152         NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2 1 2 1        Ps
1 2 2 4        Qs
16.0         threshold
1            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2 8          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2 0 2          BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
192          swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

ASC20

The following is the HPL .dat configuration file template from ASC20. Machine Spec : 8 Tesla V100

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
2            # of problems sizes (N)
175104 178176 165888 168960 172032 175104 Ns
2             # of NBs
384 256 128 256 384 192 288 320 384 384 768 1024 768 896 768 1024 512 384 640 768 896 960 1024 1152 1280 384 640 960 768 640 256  960 512 768 1152         NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4 2 8 1 2 1        Ps
4 8 2 2 4        Qs
16.0         threshold
1            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2 8          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2 0 2          BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
192          swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

SC21

The following is the HPL .dat configuration file template from SC20. Machine Spec : 8 Tesla A100

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
2            # of problems sizes (N)
346122 348122 352122 Ns
2             # of NBs
384 256 128 NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4 2 8 1 2 1        Ps
4 8 2 2 4        Qs
16.0         threshold
1            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2 8          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2 0 2          BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
192          swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

Binder

#!/bin/bash
cd $1
# Global settings

export UCX_RNDV_SCHEME=put_zcopy
export UCX_IB_PCI_RELAXED_ORDERING=on
export UCX_MEMTYPE_CACHE=n
export UCX_MAX_RNDV_RAILS=1
export UCX_RNDV_THRESH=8192

APP="$2"
me=`hostname`
lrank=$OMPI_COMM_WORLD_LOCAL_RANK

case ${lrank} in
0)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    source ../source.sh
    export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
1)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
2)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
3)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
4)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
5)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
6)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
7)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
8)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
   source ../source.sh
    export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
9)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
10)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
11)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
12)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
13)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
14)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
15)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
16)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    source ../source.sh
    export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
17)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
18)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
19)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
20)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
21)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
22)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
23)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;

24)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    source ../source.sh
    export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
25)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
26)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
27)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
28)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
29)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
30)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
31)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
32)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    source ../source.sh
    export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
33)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
34)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
35)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
36)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
37)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
38)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
39)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
40)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    source ../source.sh
    export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
41)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
42)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
43)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
44)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
45)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
46)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
47)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
48)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    source ../source.sh
    export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
49)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
50)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
51)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
52)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
53)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
54)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
55)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
56)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    source ../source.sh
    export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
57)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
    export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
  ;;
58)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
59)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP"
    export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2  taskset -c 48-71 $APP
  ;;
60)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
61)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
    export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
  ;;
62)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;
63)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
  #set GPU and CPU affinity of local rank
    echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
    export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
  ;;

esac

DevOps

这里放置有关HPC环境维护有关的资料。

BeeGFS

BeeGFS is a hardware-independent POSIX parallel file system developed with a strong focus on performance and designed for ease of use, simple installation, and management.

Please have a look at BeeGFS Architecture overview before continuing.

System Architecture Overview: Parallelism and Scale-Out

ℹ️ Note: For linux kernels 5.x

Currently, the BeeGFS kernel module is not compatible with the Linux kernel 5.x. We need to patch it manually.

Some work has been done by Build kernel module against kernel version 5.8.x and tobydarling/beegfs-7.1.4-kernel-5.6.4.

Insallation

Please follow the Quick Start Guide to install.

Here we will only give you additional notes, assuming the operating system is Debian 10.

Step 1: Package Download and Installation

Find the last version from BeeGFS Package Repository.
Find the link to repository file, it should be something like:
```
https://www.beegfs.io/release/beegfs_7.2.4/dists/beegfs-deb10.list
```
where 7.2.4 is the version number, deb10 is the distribution name & version.

Download and save the file to /etc/apt/sources.list.d/beegfs.list:

curl -Lo /etc/apt/sources.list.d/beegfs.list <the download link>

Update the package list:
```
apt-get update
```

Install the package from the repository. To avoid errors, you should only install the package you need. For example, you don't need to install beegfs-mgmtd if this machine is only a BeeGFS client.

# only install the package you need!

# management service
apt-get install beegfs-mgmtd

# metadata service; libbeegfs-ib is only required for RDMA
apt install beegfs-meta libbeegfs-ib

# storage service; libbeegfs-ib is only required for RDMA
apt install install beegfs-storage libbeegfs-ib

# client and command-line utils
apt install beegfs-client beegfs-helperd beegfs-utils

For your convenience, consider append beegfs binary path into PATH, which is /opt/beegfs/sbin/.

Step 2: Client Kernel Module Autobuild

Since we are using RDMA and installed InfiniBand kernel modules from Mellanox OFED, we should use buildArgs like this:

# /etc/beegfs/beegfs-client-autobuild.conf
buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1 OFED_INCLUDE_PATH=/usr/src/ofa_kernel/default/include

Step 3: Basic Configuration

Please read the official guide carefully first, or you will waste a lot of time.
请先完整阅读官方教程, 不然你会浪费很多时间。
請先完整閱讀官方指南, 否則你會浪費很多時間。
公式ガイドをよく読んでからでないと、多くの時間を無駄にしてしまいます。

Assuming we use such configuration:

epyc.node1: management + metadata + storage + client
epyc.node2: storage + client

We also assume you have appended /opt/beegfs/sbin/ to PATH. Otherwise, you should use prepend this path to commands we used below.

Then on node1, the commands are:

# node1

# setup management service
beegfs-setup-mgmtd -p /geekpie/beegfs_mgmtd

# setup metadata service
beegfs-setup-meta -p /geekpie/beegfs_meta -m epyc.node1

# setup storage service
beegfs-setup-storage -p /geekpie/hpc/ -i 101 -m epyc.node1

# setup client
beegfs-setup-client -m epyc.node1

On node2, the commands are:

# node2

# setup storage service
beegfs-setup-storage -p /geekpie/hpc/ -i 201 -m epyc.node2

# setup client
beegfs-setup-client -m epyc.node2

If you setuped more than once, please manually check configuration files since there may be some error.

Step 4: Service Setup

With the same assumption as above, we can start the services on node1 and node2:

# node1
# start services
systemctl start beegfs-mgmtd beegfs-meta beegfs-storage beegfs-helperd beegfs-client

# node2
# start services
systemctl start beegfs-storage beegfs-helperd beegfs-client

Step 5: Check Connectivity

We can check the connectivity using these commands:

beegfs-ctl --listnodes --nodetype=meta --nicdetails
beegfs-ctl --listnodes --nodetype=storage --nicdetails
beegfs-ctl --listnodes --nodetype=client --nicdetails
beegfs-net                # Displays connections the client is actually using
beegfs-check-servers      # Displays possible connectivity of the services
beegfs-df                 # Displays free space and inodes of storage and metadata targets

Check configuration

You can check the configuration by inspecting the config files, these files are located at /etc/beegfs/.

Please notice that if you have setup BeeGFS twice, you may need to manually fix some configuration files, like beegfs-storage.conf.

Grafana

数据源：telegraf

PBS

PBS 全称为 Portable Batch System，可以用来控制多个计算机上的任务。

常见使用方式如下，qcmd 为任意 PBS 命令:

# Name for the job
qcmd -N ramBLe_128

# Name of destination queue
qcmd -q GeekPie_CPU

# Required resources
qcmd -l nodes=4:ppn=32:amd
qcmd -l walltime=00:10:00

# Redirect stdout/stderr
qcmd -o /public/home/geekpie2/ramble-amd/ramBLe/submit/pbs-com-single-${PBS_JOBID}.out
qcmd -e /public/home/geekpie2/ramble-amd/ramBLe/submit/pbs-com-single-${PBS_JOBID}.err

常用命令

qsub: 提交任务或启动交互式 Shell
qstat: 查看任务状态
- 如果需要显示详细信息，可以使用 -f 参数
- 如果需要查看队列状态，可以使用 -Q 参数，后接队列名称
- 例如: qstat -Qf GeekPie-CPU
qdel: 删除任务

常用参数

参数	值	说明
-q	队列、服务器，或服务器上的队列	设置执行任务的主体
-N	任务名称	设置任务名称
-l	资源列表，使用逗号分隔	设置需要的资源，该命令可指定多次
-o	输出文件	stdout 内容将被重定向到该文件中，推荐使用绝对路径
-e	错误文件	stderr 内容将被重定向到该文件中，推荐使用绝对路径

参考文档

Quick Tutorial for Portable Batch System

Slurm

本超算使用的是 Slurm，详细的配置可见配合某戏精使用的 slurm 踩坑日记。

Singularity

伯克利出品的一个用户态放 docker 的地方。

Kanidm

Kanidm is an identity management server. We use it to manage users across multiple nodes.

We have two groups: a posix group geekpie-hpc for everyone, and an admin group geekpie_admins.

geekpie_admins is used for manage accounts, which is a subgroup of:

idm_people_manage_priv: create new person
idm_group_write_priv: add person into a group
idm_account_unix_extend_priv: enable posix for a person
idm_account_write_priv: add ssh key to person

To begin with, export environment variable KANIDM_URL, and login with your geekpie_admins user.

export KANIDM_URL="https://hpc-idm.geekpie.icu:8443"
kanidm login --name geekpie

Create a user

To create a user called John Smith, and add it to geekpie-hpc group:

kanidm person create jsmith "John Smith"
kanidm person update jsmith --mail "jsmith@shanghaitech.edu.cn" # --legalname
kanidm group add-members geekpie-hpc jsmith

Then enable posix, set ssh key and password.

# In kanidm uid is the same as gid. I recommend you to manually allocate a gid.
# Please see https://github.com/geekpiehpc/AnsiblePlaybook/blob/main/group_vars/epyc.yml for old uids.
kanidm person posix set jsmith --gidnumber 2345 # --shell /usr/bin/bash
kanidm person ssh add-publickey jsmith id_rsa (cat ~/.ssh/id_rsa.pub)
# Don't need this the user do not need sudo
kanidm person posix set-password jsmith

Install

curl -L -o kanidm.deb https://github.com/kanidm/kanidm/releases/download/latest/kanidm_Ubuntu_22.04_1.1.0-beta.13-2023051108041ddac86_x86_64.deb
curl -L -o kanidm_unixd.deb https://github.com/kanidm/kanidm/releases/download/latest/kanidm-unixd_Ubuntu_22.04_1.1.0-beta.13-2023051108091ddac86_x86_64.deb

sudo dpkg -i kanidm.deb kanidm_unixd.deb

/etc/kanidm/unixd:

pam_allowed_login_groups = ["geekpie-hpc"]
default_shell = "/usr/bin/bash"
home_alias = "name"
use_etc_skel = true
uid_attr_map = "name"
gid_attr_map = "name"

/etc/kanidm/config:

uri = "https://hpc-idm.geekpie.icu:8443"
verify_ca = true
verify_hostnames = true

Edit /usr/share/pam-configs/kanidm-unixd . Change priority to 0, otherwise you will be asked sudo password twice!

Restart services

sudo systemctl restart kanidm-unixd
sudo systemctl restart kanidm-unixd-tasks.service

Setup PAM and nsswitch

PAM

# THIS DIRTY HACK AND IS ACTUALLY UPSTREAM PACKAGING PROBLEM
sudo mv /etc/pam.d/kanidm-unixd /usr/share/pam-configs/
sudo pam-auth-update # check kanidm

For nssitwch, edit /etc/nsswitch.conf:

passwd:         files systemd kanidm
group:          files systemd [SUCCESS=merge] kanidm

Then Add sudoers file

echo '%geekpie-hpc   ALL=(ALL:ALL) ALL' | sudo EDITOR='tee -a' visudo /etc/sudoers.d/geekpie

Add ssh config by creating /etc/ssh/sshd_config.d/60-kanidm.conf:

AuthorizedKeysCommand /usr/bin/env kanidm_ssh_authorizedkeys %u
AuthorizedKeysCommandUser nobody

Restart sshd service

sudo systemctl restart sshd.service

Oracle 集群采用 ansible 管理机器的开机

我们把上面的 ansible 自己魔改了一份放在 github

Oracle 被主办方自带高了一个 telegraf

我们用另一个端口和 binary 部署了一个 grafana 机器放实时机器信息，后来发现这东西延时只能当历史记录看，运维又老是掉SSD，Ceph over HDD 没有 replica 真不可靠，还是不弄了。

后台大致长这样：

ISC 所用到的机器密参

ISC21

新加坡国家超算中心 Niagara

ISC22

Niagara Thor Bridges

NSCC

Used in ISC21

Login is very old, which is E5-2690 with centos6. The file system is old Lustre without flock(), so you have to disable spack check. The schedule system is using OpenPBS>

$ cat outputfile.o
Checking The CPU and Network
lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping:              2
CPU MHz:               1200.000
BogoMIPS:              5187.61
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-5
NUMA node1 CPU(s):     6-11
NUMA node2 CPU(s):     12-17
NUMA node3 CPU(s):     18-23

lspci | grep Mel
81:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
======================================================================================

        Resource Usage on 2020-04-25 10:08:30.888043:

 JobId: 9954616.wlm01
 Project: 21120227
 Exit Status: 0
 NCPUs Requested: 1         NCPUs Used: 1
                CPU Time Used: 00:00:00
 Memory Requested: 100mb               Memory Used: 0kb
                Vmem Used: 0kb
 Walltime requested: 00:10:00           Walltime Used: 00:00:00

 Execution Nodes Used: (std1708:ncpus=1:mem=102400kb)

 ======================================================================================

DGX node is good for its hack on TDP of V100-16GB.

https://help.nscc.sg/wp-content/uploads/AI_System_QuickStart.pdf

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             2794.907
CPU max MHz:         3600.0000
CPU min MHz:         1200.0000
BogoMIPS:            4390.10
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            51200K
NUMA node0 CPU(s):   0-19,40-59
NUMA node1 CPU(s):   20-39,60-79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr s
se sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4
_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp
_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hl
e avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mb
m_local dtherm ida arat pln pts md_clear flush_l1d
              total        used        free      shared  buff/cache   available
Mem:            503          58         340           0         105         442
Swap:             0           0           0
OFED-internal-4.4-2.0.7:
Ubuntu 18.04.2 LTS \n \l

Linux dgx4105 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Filesystem                                        Size  Used Avail Use% Mounted on
udev                                              252G     0  252G   0% /dev
tmpfs                                              51G  3.2M   51G   1% /run
/dev/sda2                                         440G  395G   22G  95% /
tmpfs                                             252G   12K  252G   1% /dev/shm
tmpfs                                             5.0M     0  5.0M   0% /run/lock
tmpfs                                             252G     0  252G   0% /sys/fs/cgroup
/dev/sda1                                         487M  6.1M  481M   2% /boot/efi
/dev/sdb1                                         7.0T  4.9T  1.8T  74% /raid
192.168.160.101:/home                             3.4P  2.1P  1.4P  61% /home
[davidcho@nscc03 ~]$ cat !$
cat dgx4105.txt
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             2794.907
CPU max MHz:         3600.0000
CPU min MHz:         1200.0000
BogoMIPS:            4390.10
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            51200K
NUMA node0 CPU(s):   0-19,40-59
NUMA node1 CPU(s):   20-39,60-79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
              total        used        free      shared  buff/cache   available
Mem:            503          58         340           0         105         442
Swap:             0           0           0
OFED-internal-4.4-2.0.7:
Ubuntu 18.04.2 LTS \n \l

Linux dgx4105 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Filesystem                                        Size  Used Avail Use% Mounted on
udev                                              252G     0  252G   0% /dev
tmpfs                                              51G  3.2M   51G   1% /run
/dev/sda2                                         440G  395G   22G  95% /
tmpfs                                             252G   12K  252G   1% /dev/shm
tmpfs                                             5.0M     0  5.0M   0% /run/lock
tmpfs                                             252G     0  252G   0% /sys/fs/cgroup
/dev/sda1                                         487M  6.1M  481M   2% /boot/efi
/dev/sdb1                                         7.0T  4.9T  1.8T  74% /raid
192.168.160.101:/home                             3.4P  2.1P  1.4P  61% /home
192.168.156.29@o2ib,192.168.156.30@o2ib:/scratch  2.8P  1.8P  993T  65% /scratch
tmpfs                                              51G     0   51G   0% /run/user/0
Sat May 23 06:15:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   35C    P0    43W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
06:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
85:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
86:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
0c:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
84:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
8b:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         12.23.1020
        node_guid:                      ec0d:9a03:00a4:bbde
        sys_image_guid:                 ec0d:9a03:00a4:bbde
        vendor_id:                      0x02c9
        vendor_part_id:                 4115
        hw_ver:                         0x0
        board_id:                       MT_2180110032
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 251
                        port_lid:               1417
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_3
        transport:                      InfiniBand (0)
        fw_ver:                         12.23.1020
        node_guid:                      ec0d:9a03:00aa:2960
        sys_image_guid:                 ec0d:9a03:00aa:2960
        vendor_id:                      0x02c9
        vendor_part_id:                 4115
        hw_ver:                         0x0
        board_id:                       MT_2180110032
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 251
                        port_lid:               1419
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         12.23.1020
        node_guid:                      ec0d:9a03:00aa:29b8
        sys_image_guid:                 ec0d:9a03:00aa:29b8
        vendor_id:                      0x02c9
        vendor_part_id:                 4115
        hw_ver:                         0x0
        board_id:                       MT_2180110032
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 251
                        port_lid:               1416
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_2
        transport:                      InfiniBand (0)
        fw_ver:                         12.23.1020
        node_guid:                      ec0d:9a03:00aa:2988
        sys_image_guid:                 ec0d:9a03:00aa:2988
        vendor_id:                      0x02c9
        vendor_part_id:                 4115
        hw_ver:                         0x0
        board_id:                       MT_2180110032
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 251
                        port_lid:               1422
                        port_lmc:               0x00
                        link_layer:             InfiniBand

If NSCC is used again in the competition, contact your NTU or NUS student, they have four unique logins nodes to log in to the DGX nodes.

Bridges

Thor

Reference

Performance Characteristics of the BlueField-2 SmartNIC
https://developer.nvidia.com/blog/offloading-and-isolating-data-center-workloads-with-bluefield-dpu/
https://docs.nvidia.com/networking/display/BlueFieldSWv35011563/Virtual+Switch+on+BlueField+DPU

Niagara

Used in ISC21

Login is the same as training nodes, the opened part is only Cascade lake and Ice lake, since the HPC/HPCG/HPCC requires both CPU and GPU, plz affine tasks to those nodes, normally after gia1000.

$ ssh -Y lclmaoroph@niagara.scinet.utoronto.ca
Warning: Permanently added 'niagara.scinet.utoronto.ca' (RSA) to the list of known hosts.
Password: 
===============================================================================
SciNet welcomes you to the NIAGARA supercomputer.

This is a Niagara login node. Use this node to develop and compile code,
to run short tests, and to submit computations to the scheduler.

Remember that /scratch is never backed-up.

Documentation: https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart
Support:       support@scinet.utoronto.ca or niagara@computecanada.ca
===============================================================================
lclmaoroph@nia-login06:~$

Filesystem is GPFS, an IBM initiated FS that do not have full POSIX support, with only eventual consistency. But it's real fast for write and can scale up to 50PB.

SCRATCH Area is CVFS, a temporary fast cache for SCRATCH scripts.

The node start up script requires have some problem, no echo "bla" in .bashrc but echo "bla" 1>&2. PBS requires the bash to hack who you are. Maybe you could hack the quota or raise root permission. We attempted to use middleware FS and use bashrc to mount on the alocated nodes and eventually make it happen.

Azure

这里放置有关 Azure 云服务器运维的信息。

CycleCloud

Azure CycleCloud is a tool for deploying HPC clusters in Azure and managing their workloads.

For a detailed introduction to CycleCloud, see CycleCloud Introduction. For a step-by-step guide to using CycleCloud, see Create, customize and manage an HPC cluster in Azure with Azure CycleCloud.

虽然我不想说……但最好的方法还是先去看看官方文档，了解一下它的功能，还有他的模板怎么写……新概念可能比较多，可以一边操作一边学习。但请注意他的文档不全，或者有些内容过时了。如果你想确认最新的 CycleCloud 行为，用它开一台机器，把上面 /opt/cycle 的东西下下来，看里面的代码，是最直接的方法了。

Introduction: ...So what is CycleCloud?

……想说请这个先讲点历史。

CycleCloud 本来属于 Cycle Computing，后来被微软收购了。在被微软收购以前，CycleCloud 是可以在很多平台上使用的，包括 Amazon Web Services, Google Compute Engine 甚至是公司内部集群。

它的作用就是帮你方便的管理一堆 HPC 资源，举个例子我现在想在 AWS 上开 15 台机器作为我的 HPC Cluster，正常情况下我可能是一台一台开，聪明点的人知道写个脚本，一次申请所有资源。

不过就算这样你可能还要对每台机器做一些初始化，比如配置网卡、软件、用户等等。当然现在有一些高级工具比如 Cloud-Init 可以让你更方便的完成这些初始化工作，不过用它配置一堆软件还是显得有些吃力。比较现代的解决方法是是用 Ansible，通过它和一个包含“节点内所有机器”的文件，它可以自动帮你去做各种奇奇怪怪的初始化工作（实际写起来有点像 GitHub Actions）。

还有一件事情就是你需要监控这些机器是否正常，如果不正常可能要把一些机器移除。此外可能也要根据工作负载动态调节(autoscaling)你的机器数量。这些云服务不一定会帮你做，虽然说动态调节会让人想到 k8s，但用 k8s 跑 HPC 可能现在还是有点要命吧？

说了这么多，CycleCloud 就是这么个工具，它帮你自动化的控制云上的 HPC 资源，让你只要点几下就可以建立一个好用、稳定的 HPC 集群。

……😢 实际上可能没有那么好用了，不过至少这应该是他们的愿景吧……。

Prerequisites: What do I need to know to learn it?

Template

CycleCloud 最重要的概念是模板，它是一个包含了一个集群(Cluster)所有软硬件需求的文件。CycleCloud 将根据这个文件创建集群，而你点加号看到的那些 scheduler, filesystem, etc. 都是模板，你甚至可以在 Azure GitHub 里找到这些模板。

模板使用的格式类似 ini 但又比 ini 高级。如果感觉难以下手的话，先去看看 toml 这个东西。

Cluster-init? Project?

你会留意到它文档里会提到 cluster-init 或者 project 这种东西。这两个是同义词。注意不要和 cloud-init 搞混了，这个与我们现在说的的无关。为了避免混淆，我们之后都用 project 这个词。

cloud-init 和 project 都是初始化用的，但 cloud-init 更底层，由 Azure 负责。project 则由 CycleCloud 负责。换句话说，cloud-init 跑完以后，你的机器其实已经创建好了，这个时候在被 CycleCloud 做二次初始化，而二次初始化具体来说就是执行一系列的脚本和 Chef Cookbook。

在学习 CycleCloud 的同时你最好去了解一下 Chef Infra，所有的 Project 你剥开以后就是 Chef Cookbooks，而 Chef Cookbooks 和 Ansible Playbooks 差不多，都是用来初始化机器的。学习 Chef 的时候你会发现你其实在学 Ruby……这也没有办法嘛。Ruby 的语法需要适应，尤其是之前没有接触过的话。另外尤其注意 CycleCloud 所使用的 Chef 版本（你可以在 /opt/cycle/ 里找到它的二进制文件），避免用了旧版没有的东西。

小心 CycleCloud Chef 所用的 Ruby 版本对一些 SSL 网站的支援存在问题，见 https://bugs.ruby-lang.org/issues/15594。如果要下载东西，可能要用 http 或者自己写个 Chef Resource，在其中调用 Ruby 的 ::URI.open，并且设置 ssl_verify_mode 为 OpenSSL::SSL::VERIFY_NONE。

Cloud-Init

CycleCloud 支援 Cloud—Init，但只支援一半。

当你想用 Mime Multi Part Archive 时，你会失败。当你想用 Jinja Template 时，你会失败。当你 Cluster 开到一半突然想改文件时，你会发现改完的东西好像不起作用……

更要命的是，当你加载 CycleCloud Dashboard 或者使用 CLI 列出所有集群时，它似乎会把所有的 cloud-init 文件也作为配置的一部分返回给你。结果你会发现用了很多很长的 Cloud-Init 以后使用 CycleCloud 会变慢很多。

基于以上这些原因，建议使用 Include File 格式，并将真正的 cloud-init 文件放在其他服务器上。

举个例子，我们的 cloud-init 可以这样写

#include-once

https://example.com/kyaru/base.yml
https://example.com/kyaru/sb/head.yml

url 里的 cloud-init 你随便写，想用什么格式用什么格式。赢两次！

CycleCloud CLI

Install cli from About page of CycleCloud dashboard.

Custom image

There are more images available than CycleCloud built-in images. And to make GPU work, we must use image with Generation 2 support.

Azure HPC VM Images contains all images currently available. Choose one you need, and use it's urn to specify it.

Note Some images are incompatible with CycleCloud's built-in templates. For example, you can't use microsoft-dsvm:ubuntu-hpc:2004: with Slurm Template. We have a custom template to solve it.

Note: Install on Windows

CycleCloud on Windows depends on cryptography package. So for install script to finish, we need to do this before running script

Install OpenSSL choco install openssl -pre -y
Append C:\Program Files\OpenSSL-Win64\lib; to environment variable LIB
Append C:\Program Files\OpenSSL-Win64\include; to environment variable INCLUDE

Useful links

CycleCloud debian repository
- Insiders version)

SGX Explained

ifconfig ib0 192.168.*
change ~/.bashrc for (non-)header conditional setup.

if [ -f /dev/ ]; then #头节点
setup eth .xxx ib() .xxx
fi

100 机器 nfs
slurm 启动开始跑，机器 one by one启动，任务复用、
脚本allocate 起停可以轮流睡觉轮流slurm
MIG 启动两套命令/ rmmod nvdia*
Prometheus + Grafana for all chassis

Performance

echo 2 > /proc/sys/vm/overcommit_memory
ulimit -a ulimited
echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
[ -f "/shared/opt/home/q-e" ] sudo mount 10.0.0.8:/mnt/exports/shared/home /shared/home
...

GeekPie Machine

机器简介

Currently, we are using the SuperMicro 4124GS-TNR server.

CPU: Epyc 7742

AMD claim that theoretical floating point performance can be calculated as: Double Precision theoretical Floating Point performance = #real_cores * 8 DP flop/clk * core frequency. For a 2 socket system =2 * 64 cores * 8 DP flops/ clk * 2.2 GHz=2252.8 Gflops. This includes counting FMA as two flops.

GPU

RDMA

a1:00.0 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
a1:00.1 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]

Official Documention
IB 卡的通讯协议 https://www.rdmamojo.com/2013/06/01/which-queue-pair-type-to-use/
OpenMPI 使用 http://scc.ustc.edu.cn/zlsc/user_doc/html/mpi-application/mpi-application.html

RAID

e6:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230] (rev 11) (prog-if 01 [AHCI 1.0])

Official brief: https://www.marvell.com/content/dam/marvell/en/public-collateral/storage/marvell-storage-88se92xx-product-brief-2012-04.pdf

To configure the RAID controller, the easiest way is to press Ctrl+M during booting.

If you want to boot a system on RAID, please use Legacy mode. If you switched to UEFI only, you can't find the controller even if you change it back later. To solve it, see Supermicro FAQ Entry

Firmware

It's possible to flash firmware, see Marvell 9230 Firmware Updates and such. Our current firmware is 1070 (bios oprom version). If you want to flash another firmware, you might need to make a FreeDOS bootable disk.

Note: Do backup before flashing!

Many links to firmware or utilities are broken. Station Drivers may still work. Also refer Marvell 92xx A1 Firmware Image Repository, it have a full collection of firmware images.

You can find supermicro's firmware from official site but you can't download it. Try download from http://members.iinet.net.au/~michaeldd/.

NVMe

Installed with https://www.asus.com/us/Motherboards-Components/Motherboards/Accessories/HYPER-M-2-X16-CARD-V2/.

21:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
22:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
23:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
24:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]

⚠️ The PCIE socket of the NVME card must be configured as 4x4x4x4 so as to be recognized by the system correctly.

The card may have problems. If you find it doesn't work correctly, ask in Slack.

Software

Currently we are using Ubuntu Server 20.04.3 LTS.

Boot: `systemd-boot`

We have replaced grub with systemd-boot. For introduction, see systemd-boot - ArchWiki (archlinux.org).

To configure systemd-boot, use bootctl. To change kernel parameters, modify /etc/kernel/postinst.d/zz-update-systemd-boot. GitHub backup: https://gist.github.com/KiruyaMomochi/9df313c2abc55c1736d457d48abc0f54

Network: `netplan`

Since Systemd v197, network interfaces use predictable naming schemes. See systemd.net-naming-scheme (www.freedesktop.org) for detail.

Ubuntu use netplan to configure network. It reads network configuration from /etc/netplan/*.yaml, then convert them to systemd-networkd configuration.

Netplan configuration examples: https://netplan.io/examples/.

Drivers

InfiniBand

Download drivers from Linux InfiniBand Drivers (mellanox.com).
tar -xzf MLNX_OFED_LINUX-5.4-3.1.0.0-ubuntu20.04-x86_64.tgz
cd to the directory and

sudo ./mlnxofedinstall --add-kernel-support

Configure IPoIB

For RHEL/CentOS, see IP over InfiniBand (IPoIB) - MLNX_OFED v5.1-0.6.6.0 - Mellanox Docs.

For Ubuntu, create /etc/netplan/10-infiniband.yaml with:

network:
    version: 2
    ethernets:
        ibp161s0f0: # Name of the InfiniBand interface
            addresses:
                - 11.4.3.20/24 # Change to your IP address

You may need to change interface name and ip address to your own.

Ansible

To manage two server at the same time, it's easier to use Ansible.

Network File System (NFS)

NFS is exported from node1. Only NFS v4 is supported:

/srv/nfs4 *(rw,sync,fsid=0,crossmnt,no_subtree_check)
/srv/nfs4/home *(rw,sync,no_subtree_check)

/proc/fs/nfsd/versions: -2 -3 +4 +4.1 +4.2

It is mounted on all nodes at /mnt/nfs4 and /mnt/home:

node1:/ /mnt/nfs4 nfs rw,noauto,x-systemd.automount 0 0
node1:/home /mnt/home nfs rw 0 0

You can use /mnt/home/<user> as your home directory:

# On node 1
sudo mkdir /srv/nfs4/home/<user>
# On all nodes
sudo usermod -d /mnt/home/<user> <user>

Other tools

`systemd-nspawn`

See systemd-nspawn

Tuning

Enable / Disable SMT (HyperThreading)

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading.

# From https://docs.kernelcare.com/how-to/

# Check the SMT state
cat /sys/devices/system/cpu/smt/active

#Enable SMT
echo on > /sys/devices/system/cpu/smt/control

#Disable SMT
echo off > /sys/devices/system/cpu/smt/control

Tick-free CPU

When kernel is booted with nohz_full=1-127 set, CPU 1-127 are isolated. Refer CPU Isolation - Nohz_full - by SUSE Labs (part 3) | SUSE Communities for more details.

Also see:

A full list of kernel parameters is available at https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html.

Set `kernel.yama.ptrace_scope` to 0

For temporary applying, use the following command

sudo sysctl -w kernel.yama.ptrace_scope=0

For permanent, change /etc/sysctl.d/10-ptrace.conf.

For documentation, see The Linux kernel user’s and administrator’s guide » Linux Security Module Usage - Yama.

Kernel

We use a custom kernel with NOHZ support enabled.

Build Kernel on Debian/Ubuntu

To build kernel, refer to Chapter 4. Common kernel-related tasks (pages.debian.net).

Current kernel config is at /usr/src/linux-headers-$(uname -r)/.config.

Kernel/BuildYourOwnKernel - Ubuntu Wiki and BuildADebianKernelPackage - Debian Wiki are obsolete, do not use them.

If you don't want to use module signing:

scripts/config --disable MODULE_SIG
scripts/config --disable SYSTEM_TRUSTED_KEYS

Also consider disable debug info:

scripts/config --disable DEBUG_INFO

`systemd-nspawn`

systemd-nspawn is like the chroot command, but it is a chroot on steroids. See systemd-nspawn - ArchWiki (archlinux.org) and nspawn - Debian Wiki for introduction.

Bootstrap

We can bootstrap a Debian machine using debootstrap, but also try mkosi.

For example, bootstrap a openSUSE image:

python3 -m pip install --user git+git://github.com/systemd/mkosi.git
sudo .local/bin/mkosi -d opensuse -t directory -p systemd-container --checksum --password password -o /var/lib/machines/opensuse-test

RDMA

Install

Although there is no document for systemd-nspawn, we can refer to How-to: Deploy RDMA accelerated Docker container over InfiniBand fabric.

Make sure these tools has the same version as host.

We only need to install userspace tools into nspawn container without updating firmware:

./mlnxofedinstall --user-space-only --without-fw-update

Edit `.nspawn` file

Edit .nspawn file of the container, which is located at /etc/systemd/nspawn/<machine-name>.nspawn. If such a file does not exist, create one.

Then, add following content

[Exec]
Capability=CAP_IPC_LOCK
LimitMEMLOCK=infinity

[Files]
Bind=/dev/infiniband/
Bind=/dev/hugepages

Also consider use host network by

[Network]
VirtualEthernet=no

Add `DeviceAllow`

Create a drop-in file use command

sudo systemctl edit systemd-nspawn@<machine-name>

with content of

[Service]
DeviceAllow=/dev/infiniband/uverbs0 rwm
DeviceAllow=/dev/infiniband/uverbs1 rwm

Put all of devices you want to allow there.

Test

Show status with ibstat. Test RDMA with perftest.

If you find tools like perftest does not work, it may releated to

https://gist.github.com/zshi-redhat/c7cfe9e0be63f0330952a28792acff2b
Limit on memlock, see below for solution.

Disable `memlock` limit

IB tools may fail to allocate memory if memlock limit is too small. To show current memlock limit, use

sudo systemctl show systemd-nspawn@<machine-name> --property LimitMEMLOCK

To disable limit, use

sudo systemctl edit systemd-nspawn@<machine-name>

And add LimitMEMLOCK=infinity to [Service] section, then restart your container.

Troubleshooting

No color in terminal

See Arch wiki for "broken colors" problem.

Create file /etc/systemd/system/container-getty@.service.d/term.conf in container with following contents:

[Service]
Environment=TERM=xterm-256color

Archived

Pages under this path may outdated and not reflect current setup.

Cluster Setup

Warning This is an outdated guide.

完整流程 & 踩坑笔录

机器信息 & 硬件准备

节点：4 节点（node1~4，node1 为 主节点）
网络：Ethernet（192.168.<A>.x）与 IB（192.168.<B>.x）
- 星状拓扑
- Setup 时主节点需连接外网
硬盘：每个节点一块系统盘，主节点额外挂一块 SSD 作为共享存储
Clonezilla 镜像 U 盘 * 1（镜像直接解压即可，故下述安装时需要 BIOS 设置为 UEFI 模式）
Clean Minimal CentOS7 镜像 U 盘 * 1（同上）

CentOS 操作系统安装

下载 CentOS-7 Minimal 镜像 于 U 盘，插于主节点

如果主板 BIOS 启动模式不是 UEFI 则勿忘在启动时修改 ;( 主节点需要使用外置 Clonezilla 镜像 U 盘，故也把 U 盘启动顺序置前

主节点开机 “install CentOS 7”

如果 Install 后触发了 dracut-init... timeout，在之后会跳入 dracut 命令行，输入 lsblk 后找到 U 盘设备，记下 LABEL=AAA 的值，而后 reboot；然后在选择界面按 e，修改第二行中的 LABEL=BBB 的第一段为 AAA，然后 ctrl+x 即可另一种方法是将LABEL=CentOS\x207\x20x\86_64修改为LABEL=CentOS\x207\x20x\8 https://blog.csdn.net/qq_36937234/article/details/82996998 https://access.redhat.com/solutions/2515741

需调整项如下：

磁盘分区 / + /boot 即可，根目录各子目录不分散分区，格为ext4
主机名 Hostname 设为 node1

待安装完成，以 root 用户可正常登陆

关闭 SELinux：修改 /etc/selinux/config，设置 SELINUX=disabled

关闭 Firewall 防火墙：

systemctl stop firewalld.service
systemctl disable firewalld.service

很多问题都会由上述两个安全服务引起，在超算比赛内网环境下无用，先全关闭

以太网连接配置

先配置主节点连接外网，再将各节点内网连接

Ethernet 外网

连接外网以太网线（记住对应网口 <INTERFACE>，e.g. eno2）

使用 ip 指令检查 DNS 地址等信息，而后在输入 nmtui 进入 GUI 网络设置界面，设置外网连接为 DHCP 模式，填入 DNS 服务器地址，而后使用 curl 进行校网登录：

$ dhclient -v <INTERFACE>
$ curl -X POST --data “userName=<USERNAME>&password=<PASSWD>&hasValidateCode=false&authLan=zh_CN” https://10.15.44.172:8445/PortalServer//Webauth/webAuthAction\!login.action

此时可以连接外网，需要记录下本机 ip 地址以便远程连接（校网中途不关机则 DHCP ip 地址应该不会改变）；curl <URL> 检查外网连接

Ethernet 内网

同样使用 ip 工具看到网关地址等信息，使用 nmtui GUI 工具对内网网口（e.g. eno1）进行配置，e.g. 主节点 192.168.<A>.1

驱动下载 & 安装

IB 驱动 & Nvidia 驱动，安装在默认位置（因为共享盘还未配置），故在拷盘前做好

IB 驱动和配置

IB 驱动

Nvidia 驱动

yum install kernel-dev epel-release dkms 来添加 Redhat 源及其他 Nvidia 驱动依赖

关闭默认 nouveau 显卡驱动：

$ vi /etc/default/grub    # `GRUB_CMDLINE_LINUX` 选项中添加 `nouveau.modeset=0`
$ grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
$ reboot

不要用官网的 rpm 包安装驱动 ;(

重启后，在官网查询所用卡对应的最新驱动版本 <VER.SUB>（e.g. V100 目前最新为 410.79），获取安装脚本并安装：

$ wget http://us.download.nvidia.com/XFree86/Linux-x86_64/<VER.SUB>/NVIDIA-Linux-x86_64-<VER.SUB>.run
$ bash NVIDIA-Linux-x86_64-<VER.SUB>.run --kernel-source-path /usr/src/kernels/xxx    # 若报错加此选项安装试试

试 nvidia-smi 指令看能否获取到显卡信息

克隆创建子节点

先安装必要的基本工具以减少重复工作：yum -y install <TOOL-NAME>：

NFS：nfs-utils rpcbind
Lmod：environment-modules
其他：gcc gcc-c++ perl wget（通过 yum 预安装 gcc 用于并行库工具等的编译安装）

主节点关机，插入 Clonezilla 工具盘，从它启动，将主节点系统盘克隆至子节点系统盘内（勿搞错拷贝 Source & Target 盘方向）：https://www.tecmint.com/linux-centos-ubuntu-disk-cloning-backup-using-clonezilla

子节点插入系统盘后，分别登陆各子节点，修改主机名和静态 ip 地址（内网网口），以便互联识别身份，注意 4 节点 ip 和主机名互不相同：

# e.g. node2 节点
$ hostnamectl set-hostname node2
$ vi /etc/sysconfig/network-scripts/ifcfg-<INTERFACE>   #修改 IPADDR=192.168.<A>.2

数据盘 NFS 共享（over IB RDMA）

howto-configure-nfs-over-rdma--roce-x

Maybe useful according to teacher Zhang

opensmd
openibd

数据盘 NFS 共享（over TCP）备选

主节点插上用作共享盘的硬盘，lsblk 查看新硬盘已插上及名称，可看到出现 e.g. sdb1 盘（根据大小判断那个为共享盘，勿搞错）

格式化磁盘流程备忘：

$ fdisk /dev/sdb1 $ n # 新建分区 $ p 1 [Enter] [Enter] # 整个盘建立为一个大主分区

挂载该磁盘并在其中建立欲共享的目录（/home 和 /opt）:

$ mount /dev/nvme0n1 /mnt/nfs
$ mkdir /mnt/nfs/home
$ mkdir /mnt/nfs/opt

主节点启动 NFS server，编辑共享目录配置 /etc/exports，添加条目（注意不多加空格）：

/mnt/nfs/home 192.168.<A>.0/24(rw,no_root_squash,no_all_squash,sync)
/mnt/nfs/opt  192.168.<A>.0/24(rw,no_root_squash,no_all_squash,sync)

参数解释：

rw：可读写
no_*_squash：客户节点以 * 身份使用时不降级为匿名普通用户
sync：各端的写操作同步至磁盘

开启服务并设置开机自启：

$ exportfs -r
$ service rpcbind start
$ service nfs start
$ chkconfig rpcbind on
$ chkconfig nfs on

设置主节点防火墙允许 NFS 访问请求：

$ firewall-cmd --permanent --add-service=mountd
$ firewall-cmd --permanent --add-service=nfs
$ firewall-cmd --permanent --add-service=rpc-bind
$ firewall-cmd --reload

修改 /etc/fstab，使主节点将共享目录 bind mount （目录树到目录树挂载）到 /home /opt，子节点由 NFS 将主节点目录挂载：

# On node1，在 /etc/fstab 文件末尾添加
/dev/nvme0n1    /mnt/nfs    ext4    rw,user,exec,suid,dev,auto,async
/mnt/nfs/home   /home       none    rw,user,exec,suid,dev,auto,async,bind
/mnt/nfs/opt    /opt        none    rw,user,exec,suid,dev,auto,async,bind

# On node2~4，在 /etc/fstab 文件末尾添加
node1:/mnt/nfs/home    /home   nfs     rw,user,exec,suid,dev,auto,async
node1:/mnt/nfs/home    /opt    nfs     rw,user,exec,suid,dev,auto,async

而后每次开机后，各节点均登入 root 用户，先在主节点 mount -a，后在各子节点 mount -a 即可成功挂载共享目录

全手动挂载方式备忘，开机后首先在主节点：

$ mount /dev/nvme0n1 /mnt/nfs $ mount --bind /mnt/nfs/home /home $ mount --bind /mnt/nfs/opt /opt

而后在各子节点：

$ showmount -e node1 # 检测是否有来自主节点的 nfs 共享 $ mount -t nfs node1:/mnt/nfs/home /home $ mount -t nfs node1:/mnt/nfs/opt /opt

出现 “Stale file handle” 问题 / “Access denied” 问题，在主节点重启 NFS：systemctl restart nfs 后再挂载一遍即可

SSH 免密码登录配置

首先配置 root 用户相互 ssh 免密登陆，所有节点对之间均需配置，e.g. 在主节点 /root 下：

$ ssh-keygen            # 位置名称默认
$ ssh-copy-id node1    # 自身节点也需拷贝
$ ...
$ ssh-copy-id node4

而后在各自节点均创建普通用户，注意 相同名称 & 相同 uid & 相同 group (gid) & 相同密码：

$ useradd <USERNAME> -m
$ passwd <USERNAME>
$     [Type new PASSWORD] [Type again]    # 设置密码，不要通过 useradd 的 -p 选项，密码不规范时会失败

密码通过 passwd 指令设置，否则密码不规范时 -p 选项可能失败且不会给出提示按相同顺序创建即是，可以通过 cat /etc/passwd 检查

任意节点进入普通用户，生成并拷贝密钥（注意普通用户 Home 目录共享）：

$ su testuser
$ cd
$ ssh-keygen
      [Enter] [Enter] [Enter]
$ ssh-copy-id localhost

编译器、并行库和环境的安装

环境安装目录文件树放置于 /opt 下

所需环境及安装流程 - 见 “Environment Installation”

环境 Environment Modules 配置

前面已下载过 Lmod 工具；共享盘中 mkdir /opt/modulefiles 作为 modulefile 存储位置，而后在每个节点固定 modulefile 搜索路径，于 /etc/environment 中添加行：

export MODULEPATH=/opt/modulefiles

勿忘 source /etc/environment

曾用 modulefile 文件 - 见 “Modulefile Records”

Environment Installation

Warning This is an outdated guide.

环境安装方式 + 目录树位置

安装目录树结构

|- /opt/
    |- openmpi/
        |- 4.0
        |- 3.1
        |- ...
    |- mpich/
    |- intel/    # Intel 全家福
    |- blas/
    |- gcc/
    |- cuda/     # Nvidia CUDA
    |- pgi/      # CUDA PGi Edition
    |- netcdf/
        |- netcdf-c/
        |- netcdf-fort/
    |- pnetcdf

编译安装基本六连：

$ wget [SOURCE_URL] $ tar zxvf openmpi-4.0.0.tar.gz $ cd openmpi-4.0.0/ $ ./configure --prefix=‘/opt/mpi/openmpi/4.0’ # 注意规划好位置 $ make -j8 $ make install

包管理系统

spack Environment Modules 二选一， spack 基本上就是对系统级的modules的高层API，从ASC20开始，我们开始使用spack。为保证目录树nfs共享结构，把spack 放在/opt 目录下。

$ git clone https://github.com/spack/spack.git $ cd spack/bin $ ./spack install libelf # test $ echo "export PATH=$PATH:/opt/spack/bin" >> ~/.bashrc $ echo ". /opt/spack/share/spack/setup-env.sh" >> ~/.bashrc $ bash

依赖以及版本spec

$ spack install intel^gcc@9

在测试时使用

$ spack load intel^gcc@9 也可以module avail 后查看需要load 的环境module load intel

添加新的编译器，在自己编译安装好一个编译器并在PATH中可以找到的时候，可以使用spack find命令，之后就可以用这个编译器@intel来编译新的编译器了。

特殊注意，在编译安装mpi,omp过程中一定要开启 --with-rdma 选项以支持Infiniband.

编译器

gcc (包含 gfortran) - Version 7.4 + 5.5 + 4.9.4 + 4.4.7
- CentOS 自带的gcc版本过老，可以使用 scl enable devtoolset-9 bash以支持最新gcc特性。
- 7.4：gcc-7.4.0.tar.gz
- 5.5：gcc-5.5.0.tar.gz
- 4.4.7：gcc-4.4.7.tar.gz
icc & ifort：包含于 Intel Parallel Studio XE 中

Intel Parallel Studio XE 全家桶

Parallel Studio XE：按照 This Procedure 获取和安装，19-20 授权如下

序列号 S4ZD-MMZJXJ96 (若失效，可以前往英特尔官网申请，在下方register center，若为spack 安装只需在安装过程中输入即可)
URL：parallel_studio_xe_2019_update2_cluster_edition.tgz
LICENSE：官网 Registration Center 下载后传至 Server

icc ifort mkl IntelMPI 均包含于 Parallel Studio XE 中 spack install intel

由于cuda只支持编译它的编译器头文件在gcc-7标准以前，所以建议使用intel@18.0.3

MPI

OpenMPI - Version 4.0 + 3.1 + 3.0 + 2.1
- 4.0：openmpi-4.0.0.tar.gz
- 3.1：openmpi-3.1.3.tar.gz
- 3.0：openmpi-3.0.3.tar.gz
- 2.1：openmpi-2.1.6.tar.gz
MPICH - Version 3.3 + 3.2.1
- 3.3：mpich-3.3.tar.gz
- 3.2.1：mpich-3.2.1.tar.gz
IntelMPI：包含于 Intel Parallel Studio XE 中

Nvidia CUDA

CUDA Toolkit：spack install cuda@10.2 以支持不同版本。
PGi Edition：spack install pgi@19.10

Math Libraries

MKL：包含于 Intel Parallel Studio XE 中
OpenBLAS：spack install openblas

NetCDF I/O

用于ASC19 的IO500题目中。

Environment Modules

Warning This is an outdated guide.

Environment Module: Modulefiles 目录树结构 + 备份

Modulefiles 目录树结构（Deprecated）

|- /opt/modulefiles    # 路径并非完全按照冲突关系组织，modulefile 中冲突关系要注意
    |- mpi/
        |- openmpi/
            |- 4.0
            |- 3.1
            |- ...
        |- mpich/
        |- intelmpi/
    |- math/
        |- mkl/
        |- blas/
    |- compilers/
        |- gcc/
        |- icc/
        |- ifort/
    |- cuda/
        |- nvidia/
        |- pgi/
    |- netcdf/
        |- pnetcdf/
        |- netcdf-c/
        |- netcdf-fort/

It-Support Machine

机器简介

我们一共拥有四个图信账号，分别是：GeekPieHPC{1, 2, 3, 4}。四个账号均位于同一 IP 地址下，请到 Slack 查看机器的 IP 地址。

连接方式

这些账号均可使用 ssh 命令连接，或使用 scp 命令进行文件传输，端口号为 22112。

四个账号均位于同一 IP 地址下，见 https://geekpiehpc.slack.com/archives/C0210BA22QH/p1631708325019000。

密钥: 您可以使用连接到 Epyc 机器的密钥进行登录，示例配置如下

Host geekpie<数字>
HostName <在这里填上目标机器的 IP 地址>
User geekpie<数字>
Port 22112
IdentityFile ~/.ssh/id_rsa_epyc

密码: 您可在 Slack 中找到密码。

调度器

图信机器使用 PBS (Portable Batch System) 进行调度，其多数命令以 q 开头，如可以使用 qstat 查看调度器的状态。

CPU 队列为 GeekPie_HPC，GPU 队列为 GeekPie_GPU。

PBS 的具体使用方式请看 DevOps/Scheduler。

环境管理

在图信机器上，首推使用 module 进行环境变量的管理，不过和编译器打交道时，还是使用 spack 安装编译器。

支持与帮助

可以使用以下方式联系图信

微信: 群中的 Saber 为图信联络人。
办公室: 图信办公室位置是 H1 304，H1 楼是医务室那栋楼。

BMC fuck

We think it's worthwhile to reverse the BMC for the following reasons:

Fine-grained adjustment of GPU power consumption (super TDP)
Can be 1 machine 4 cards. (Tsinghua last year on 2 cards)
PCIe device hot-swapping

Also see:

https://github.com/l4rz/reverse-engineering-dell-idrac-to-get-rid-of-gpu-throttling

Salt Stack

salt 主要干的事就是快速配置文件。可也不止那么多，加上 jinja 和 LDAP 可以做一个私钥管理系统。

由于前人维护者跑路了，只能有新来的同学接手了。

linux 的秘密都在 PAM 里

大家熟练使用 journal -x debug 各种系统用户认证的各种状态机的时候，抑或是 sshd 时，可以看到 PAM、xsecurity、sssd 这种字眼，这是用户认证协议。SSSD 就是一个 daemon，把系统的 NSS PAM 的机制和 LDAP 连接起来。（之前 20.04 gnome 被打也是这个协议被绕过的结果。

熟悉了 PAM 之后，你也就知道为毛 ~/.ssh/id_*.pub 需要 r..，这写死在 nss 用户目录协议里了。

继续阅读：https://jia.je/software/2021/02/15/sssd-ldap/

ref

C++

终于到我校专业学习的语言了。

C++ 17 & 20

有关新特性永远是恒久不变的话题。从 C++11 的左右值到现在已经有很久时间了。在epyc 上一般的编译器性能排序是 AOCC>Intel>GCC>>LLVM. 但是MKL相对于amd的优化库还有一定优势，所以有时候Intel开最基本的x86优化也能和AOCC差不多。C++ 17对并行化做了很多优化，比如pstl、for_each（threading）、pmr、intel 的 sycl、nvidia的thrust/cub。可以很方便的修改namespace来获得无痛性能提升。C++ 20 最重要的特性是 ranges、filesystem等。不过LLVM 对新版本的支持一直都是最慢的。之前 icc 可以兼容一些古老的语法比如 VLA，但到了新版本以后取消了，这种不稳定的特性变更导致大厂不怎么把其当成标准。

这是笔者在其编译原理作业中出的一道题，需要给出 semantic rule 来拒绝 ??? 那行。但貌似只能在icc中通过标准。

template<class T>class array{ 
  int s;
  T* elements; 
public:
  array(int n); // allocate "n" elements and let "elements" refer to them 
  array(T* p, int n); // make this array refer to p[0..n-1]
  operator T*(){return elements;}
  int size()const{return s;} 
  // the usual container operations, such as = and [], much like vector 
};

void h(array<double>a); //C++
void g(int m,double vla[m]); //C99
void f(int m,double vla1[m],array<double>a1) {
    array<double> a2(vla1,m); // a2 refers to vla1 
    double*p=a1; //p refers to a1's elements

    g(m,vla1);
    g(a1.size(),a1); // a bit verbose 
    g(a1); //???
}

有关玄学

所有编译器可能出现的segfault，很多时候换个 intel 小版本可以通过。

有关编译自闭的时候的想法

建议若是 CMake 打开 make -n, configure 打开 make VERBOSE=1。如果需要与展开宏编译器，需要 CMake 开 -e 选项得到展开后的表达式。

需要广泛运用好 man、--help。

编译选项

LTO

为了解决不同库或者跨语言之间调用的开销，这块使用的基本是 LLVM 的 libLTO 和 tblgen。这个是自动开启的，原理是把库都弄成 LLVM bitcode 统一链接，其实并行版的 LTO 也不是很难实现，曾是前队长用 rust 写的并行计算的 project，具体可以参考源码。

PGO

通过分析程序运行时的实际行为，将结果反馈给编译器，使得编译器可以重新安排代码以减少指令缓存问题和分支预测误判，从而获得性能的提升。性能分析引导优化通过实际执行代码统计出运行频率最高的部分，编译器通过这些信息可以更加针对地优化代码。

第一阶段：编译参数中加上：-prof-gen=srcpos -prof-dir=/tmp/profdata。其中-prof-dir是存储性能分析文件的目录。
第二阶段：运行编译好的程序，然后运行profmerge -prof_dir /tmp/profdata生成汇总文件。
第三阶段：重新编译程序，使用参数：-prof-use=nomerge -prof-func-groups -prof-dir=/tmp/profdata。

IPO

过程间优化和性能分析引导优化可能会相互影响，性能分析引导优化通常会帮助编译器生成内联函数，这会帮助过程间优化的效率。性能分析引导优化对分支预测效率的提升最有效果，许多分支执行的可能性无法在编译时判断，而通过性能分析引导优化，编译器可以针对经常执行的分支（热代码）和不经常执行的分支（冷代码）生成高效的汇编代码。

HLO

从下文我们知道的 LLVM IR 就已经出现的部分优化，我们知道 icc 实际上在 LLVM IR 之前还拥有 high level 的 ir。根据文档，主要做了

Loop Permutation or Interchange
Loop Distribution
Loop Fusion
Loop Unrolling
Data Prefetching
Scalar Replacement
Unroll and Jam
Loop Blocking or Tiling
Partial-Sum Optimization
Predicate Optimization
Loop Reversal
Profile-Guided Loop Unrolling
Loop Peeling
Data Transformation: Malloc Combining and Memset Combining, - Memory Layout Change
Loop Rerolling
Memset and Memcpy Recognition
Statement Sinking for Creating Perfect Loopnests
Multiversioning: Checks include Dependency of Memory References, - and Trip Counts
Loop Collapsing

DOP

大家如果写过 UE 这种游戏引擎的程序或者kernel中需要 cache align 的struct，以及最近几年对DB内卷式的优化，会很熟悉这种数据结构。其最重要的思想就是把数据塞在 avx 对齐的 struct 中，所有的操作都是围绕着对struct的加乘运算。详见 https://neil3d.github.io/assets/img/ecs/DOD-Cpp.pdf。

LLVM

DC++/AOCC 都开始使用 LLVM 作为中间层

ICC 大致做了什么新特性

以最新的 2021.3.0 做静态分析，用 saxpy 做标程。意思是 Single-Precision A·X Plus Y。这是一维 BLAS 中的一个函数，经常被写作 kernel 来各种调参各种调寄存器和内存模型。其 C++ 版本如下

void saxpy(int n, float a, float *  x, float *  y)
{
  for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];
}

LLVM IR 如下，完整路径是 https://godbolt.org/z/j5rrxhedG，主要hard code 进了各种预优化好的汇编，尤其是mov高地址这种快速取指方式。感觉是把VADDSS __m128 _mm_mask_add_ss (__m128 s, __mmask8 k, __m128 a, __m128 b);这种 Intel C/C++ Compiler Intrinsic Equivalent 当成库函数一起编译到 IR 上了。

	.section .text
.LNDBG_TX:
# mark_description "Intel(R) C Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.3.0 Build 2021";
# mark_description "0609_000000";
# mark_description "-g -o /app/output.s -masm=intel -S -gxx-name=/opt/compiler-explorer/gcc-10.1.0/bin/g++ -emit-llvm";
	.intel_syntax noprefix
	.file "example.cpp"
	.text
..TXTST0:
.L_2__routine_start_saxpy(int, float, float*, float*)_0:
# -- Begin  saxpy(int, float, float*, float*)
	.text
# mark_begin;

	.globl saxpy(int, float, float*, float*)
# --- saxpy(int, float, float *, float *)
saxpy(int, float, float*, float*):
# parameter 1(n): edi
# parameter 2(a): xmm0
# parameter 3(x): rsi
# parameter 4(y): rdx
..B1.1:                         # Preds ..B1.0
                                # Execution count [0.00e+00]
	.cfi_startproc
	.cfi_personality 0x3,__gxx_personality_v0
..___tag_value_saxpy(int, float, float*, float*).2:
..L3:
                                                          #2.1
..LN0:
	.file   1 "/app/example.cpp"
	.loc    1  2  is_stmt 1
        push      rbp                                           #2.1
	.cfi_def_cfa_offset 16
..LN1:
        mov       rbp, rsp                                      #2.1
	.cfi_def_cfa 6, 16
	.cfi_offset 6, -16
..LN2:
        sub       rsp, 48                                       #2.1
..LN3:
        mov       DWORD PTR [-40+rbp], edi                      #2.1
..LN4:
        movss     DWORD PTR [-32+rbp], xmm0                     #2.1
..LN5:
        mov       QWORD PTR [-24+rbp], rsi                      #2.1
..LN6:
        mov       QWORD PTR [-16+rbp], rdx                      #2.1
..LN7:
	.loc    1  3  prologue_end  is_stmt 1
        mov       DWORD PTR [-48+rbp], 0                        #3.14
..LN8:
                                # LOE rbx rbp rsp r12 r13 r14 r15 rip
..B1.2:                         # Preds ..B1.3 ..B1.1
                                # Execution count [0.00e+00]
..LN9:
        mov       eax, DWORD PTR [-48+rbp]                      #3.19
..LN10:
        mov       edx, DWORD PTR [-40+rbp]                      #3.23
..LN11:
        cmp       eax, edx                                      #3.23
..LN12:
        jge       ..B1.4        # Prob 50%                      #3.23
..LN13:
                                # LOE rbx rbp rsp r12 r13 r14 r15 rip
..B1.3:                         # Preds ..B1.2
                                # Execution count [0.00e+00]
..LN14:
	.loc    1  4  is_stmt 1
        movss     xmm0, DWORD PTR [-32+rbp]                     #4.14
..LN15:
        mov       eax, DWORD PTR [-48+rbp]                      #4.18
..LN16:
        movsxd    rax, eax                                      #4.16
..LN17:
        imul      rax, rax, 4                                   #4.16
..LN18:
        add       rax, QWORD PTR [-24+rbp]                      #4.16
..LN19:
        movss     xmm1, DWORD PTR [rax]                         #4.16
..LN20:
        mulss     xmm0, xmm1                                    #4.16
..LN21:
        mov       eax, DWORD PTR [-48+rbp]                      #4.25
..LN22:
        movsxd    rax, eax                                      #4.23
..LN23:
        imul      rax, rax, 4                                   #4.23
..LN24:
        add       rax, QWORD PTR [-16+rbp]                      #4.23
..LN25:
        movss     xmm1, DWORD PTR [rax]                         #4.23
..LN26:
        addss     xmm0, xmm1                                    #4.23
..LN27:
        mov       eax, DWORD PTR [-48+rbp]                      #4.9
..LN28:
        movsxd    rax, eax                                      #4.7
..LN29:
        imul      rax, rax, 4                                   #4.7
..LN30:
        add       rax, QWORD PTR [-16+rbp]                      #4.7
..LN31:
        movss     DWORD PTR [rax], xmm0                         #4.7
..LN32:
	.loc    1  3  is_stmt 1
        mov       eax, 1                                        #3.28
..LN33:
        add       eax, DWORD PTR [-48+rbp]                      #3.28
..LN34:
        mov       DWORD PTR [-48+rbp], eax                      #3.28
..LN35:
        jmp       ..B1.2        # Prob 100%                     #3.28
..LN36:
                                # LOE rbx rbp rsp r12 r13 r14 r15 rip
..B1.4:                         # Preds ..B1.2
                                # Execution count [0.00e+00]
..LN37:
	.loc    1  5  epilogue_begin  is_stmt 1
        leave                                                   #5.1
	.cfi_restore 6
..LN38:
        ret                                                     #5.1
..LN39:
                                # LOE
..LN40:
	.cfi_endproc
# mark_end;
	.type	saxpy(int, float, float*, float*),@function
	.size	saxpy(int, float, float*, float*),.-saxpy(int, float, float*, float*)
..LNsaxpy(int, float, float*, float*).41:
.LNsaxpy(int, float, float*, float*):
	.data
# -- End  saxpy(int, float, float*, float*)
	.data
	.section .note.GNU-stack, ""
// -- Begin DWARF2 SEGMENT .debug_info
	.section .debug_info
.debug_info_seg:
	.align 1
	.4byte 0x000000be
....

汇编如下，可以看到每一个分支都有概率预测。自动向量化。

saxpy(int, float, float*, float*):
        mov       r9, rsi                                       #2.1
        test      edi, edi                                      #3.23
        jle       ..B1.36       # Prob 50%                      #3.23
        cmp       edi, 6                                        #3.3
        jle       ..B1.30       # Prob 50%                      #3.3
        movsxd    r8, edi                                       #1.6
        mov       rax, rdx                                      #4.16
        sub       rax, r9                                       #4.16
        lea       rcx, QWORD PTR [r8*4]                         #3.3
        cmp       rax, rcx                                      #3.3
        jge       ..B1.5        # Prob 50%                      #3.3
        neg       rax                                           #4.23
        cmp       rax, rcx                                      #3.3
        jl        ..B1.30       # Prob 50%                      #3.3
..B1.5:                         # Preds ..B1.4 ..B1.3
        cmp       edi, 8                                        #3.3
        jl        ..B1.38       # Prob 10%                      #3.3
        mov       r10, rdx                                      #3.3
        and       r10, 15                                       #3.3
        test      r10d, r10d                                    #3.3
        je        ..B1.9        # Prob 50%                      #3.3
        test      r10d, 3                                       #3.3
        jne       ..B1.38       # Prob 10%                      #3.3
        neg       r10d                                          #3.3
        add       r10d, 16                                      #3.3
        shr       r10d, 2                                       #3.3
..B1.9:                         # Preds ..B1.8 ..B1.6
        lea       eax, DWORD PTR [8+r10]                        #3.3
        cmp       edi, eax                                      #3.3
        jl        ..B1.38       # Prob 10%                      #3.3
        mov       esi, edi                                      #3.3
        xor       ecx, ecx                                      #3.3
        sub       esi, r10d                                     #3.3
        and       esi, 7                                        #3.3
        neg       esi                                           #3.3
        add       esi, edi                                      #3.3
        mov       eax, r10d                                     #3.3
        test      r10d, r10d                                    #3.3
        jbe       ..B1.14       # Prob 9%                       #3.3
..B1.12:                        # Preds ..B1.10 ..B1.12
        movss     xmm1, DWORD PTR [r9+rcx*4]                    #4.16
        mulss     xmm1, xmm0                                    #4.16
        addss     xmm1, DWORD PTR [rdx+rcx*4]                   #4.23
        movss     DWORD PTR [rdx+rcx*4], xmm1                   #4.7
        inc       rcx                                           #3.3
        cmp       rcx, rax                                      #3.3
        jb        ..B1.12       # Prob 82%                      #3.3
..B1.14:                        # Preds ..B1.12 ..B1.10
        lea       rcx, QWORD PTR [r9+rax*4]                     #4.16
        test      rcx, 15                                       #3.3
        je        ..B1.18       # Prob 60%                      #3.3
        movaps    xmm1, xmm0                                    #1.6
        shufps    xmm1, xmm1, 0                                 #1.6
        movsxd    rcx, esi                                      #3.3
..B1.16:                        # Preds ..B1.16 ..B1.15
        movups    xmm2, XMMWORD PTR [r9+rax*4]                  #4.16
        movups    xmm3, XMMWORD PTR [16+r9+rax*4]               #4.16
        mulps     xmm2, xmm1                                    #4.16
        mulps     xmm3, xmm1                                    #4.16
        addps     xmm2, XMMWORD PTR [rdx+rax*4]                 #4.23
        addps     xmm3, XMMWORD PTR [16+rdx+rax*4]              #4.23
        movups    XMMWORD PTR [rdx+rax*4], xmm2                 #4.7
        movups    XMMWORD PTR [16+rdx+rax*4], xmm3              #4.7
        add       rax, 8                                        #3.3
        cmp       rax, rcx                                      #3.3
        jb        ..B1.16       # Prob 82%                      #3.3
        jmp       ..B1.21       # Prob 100%                     #3.3
..B1.18:                        # Preds ..B1.14
        movaps    xmm1, xmm0                                    #1.6
        shufps    xmm1, xmm1, 0                                 #1.6
        movsxd    rcx, esi                                      #3.3
..B1.19:                        # Preds ..B1.19 ..B1.18
        movups    xmm2, XMMWORD PTR [r9+rax*4]                  #4.16
        movups    xmm3, XMMWORD PTR [16+r9+rax*4]               #4.16
        mulps     xmm2, xmm1                                    #4.16
        mulps     xmm3, xmm1                                    #4.16
        addps     xmm2, XMMWORD PTR [rdx+rax*4]                 #4.23
        addps     xmm3, XMMWORD PTR [16+rdx+rax*4]              #4.23
        movups    XMMWORD PTR [rdx+rax*4], xmm2                 #4.7
        movups    XMMWORD PTR [16+rdx+rax*4], xmm3              #4.7
        add       rax, 8                                        #3.3
        cmp       rax, rcx                                      #3.3
        jb        ..B1.19       # Prob 82%                      #3.3
..B1.21:                        # Preds ..B1.19 ..B1.16
        lea       eax, DWORD PTR [1+rsi]                        #3.3
        cmp       eax, edi                                      #3.3
        ja        ..B1.36       # Prob 50%                      #3.3
        sub       r8, rcx                                       #3.3
        cmp       r8, 4                                         #3.3
        jl        ..B1.39       # Prob 10%                      #3.3
        mov       eax, r8d                                      #3.3
        xor       r10d, r10d                                    #3.3
        and       eax, -4                                       #3.3
        lea       rdi, QWORD PTR [rdx+rcx*4]                    #4.23
        movsxd    rax, eax                                      #3.3
        lea       rcx, QWORD PTR [r9+rcx*4]                     #4.16
..B1.24:                        # Preds ..B1.24 ..B1.23
        movups    xmm2, XMMWORD PTR [rcx+r10*4]                 #4.16
        mulps     xmm2, xmm1                                    #4.16
        addps     xmm2, XMMWORD PTR [rdi+r10*4]                 #4.23
        movups    XMMWORD PTR [rdi+r10*4], xmm2                 #4.7
        add       r10, 4                                        #3.3
        cmp       r10, rax                                      #3.3
        jb        ..B1.24       # Prob 82%                      #3.3
..B1.26:                        # Preds ..B1.24 ..B1.39
        cmp       rax, r8                                       #3.3
        jae       ..B1.36       # Prob 9%                       #3.3
        movsxd    rsi, esi                                      #4.7
        lea       rcx, QWORD PTR [rdx+rsi*4]                    #4.23
        lea       rdx, QWORD PTR [r9+rsi*4]                     #4.16
..B1.28:                        # Preds ..B1.28 ..B1.27
        movss     xmm1, DWORD PTR [rdx+rax*4]                   #4.16
        mulss     xmm1, xmm0                                    #4.16
        addss     xmm1, DWORD PTR [rcx+rax*4]                   #4.23
        movss     DWORD PTR [rcx+rax*4], xmm1                   #4.7
        inc       rax                                           #3.3
        cmp       rax, r8                                       #3.3
        jb        ..B1.28       # Prob 82%                      #3.3
        jmp       ..B1.36       # Prob 100%                     #3.3
..B1.30:                        # Preds ..B1.4 ..B1.2
        mov       eax, edi                                      #3.3
        mov       esi, 1                                        #3.3
        xor       ecx, ecx                                      #3.3
        shr       eax, 1                                        #3.3
        je        ..B1.34       # Prob 9%                       #3.3
..B1.32:                        # Preds ..B1.30 ..B1.32
        movss     xmm1, DWORD PTR [r9+rcx*8]                    #4.16
        mulss     xmm1, xmm0                                    #4.16
        addss     xmm1, DWORD PTR [rdx+rcx*8]                   #4.23
        movss     DWORD PTR [rdx+rcx*8], xmm1                   #4.7
        movss     xmm2, DWORD PTR [4+r9+rcx*8]                  #4.16
        mulss     xmm2, xmm0                                    #4.16
        addss     xmm2, DWORD PTR [4+rdx+rcx*8]                 #4.23
        movss     DWORD PTR [4+rdx+rcx*8], xmm2                 #4.7
        inc       rcx                                           #3.3
        cmp       rcx, rax                                      #3.3
        jb        ..B1.32       # Prob 63%                      #3.3
        lea       esi, DWORD PTR [1+rcx+rcx]                    #4.7
..B1.34:                        # Preds ..B1.33 ..B1.30
        lea       eax, DWORD PTR [-1+rsi]                       #3.3
        cmp       eax, edi                                      #3.3
        jae       ..B1.36       # Prob 9%                       #3.3
        movsxd    rsi, esi                                      #3.3
        movss     xmm1, DWORD PTR [-4+r9+rsi*4]                 #4.16
        mulss     xmm0, xmm1                                    #4.16
        addss     xmm0, DWORD PTR [-4+rdx+rsi*4]                #4.23
        movss     DWORD PTR [-4+rdx+rsi*4], xmm0                #4.7
..B1.36:                        # Preds ..B1.28 ..B1.21 ..B1.34 ..B1.38 ..B1.1
        ret                                                     #5.1
..B1.38:                        # Preds ..B1.5 ..B1.7 ..B1.9
        xor       esi, esi                                      #3.3
        cmp       edi, 1                                        #3.3
        jb        ..B1.36       # Prob 50%                      #3.3
..B1.39:                        # Preds ..B1.22 ..B1.38
        xor       eax, eax                                      #3.3
        jmp       ..B1.26       # Prob 100%                     #3.3

下面是 AOCC LLVM IR emit 出来的代码，并没有在 IR 上做什么文章，和 clang emit 的基本一样。

; ModuleID = './a.c'
source_filename = "./a.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: noinline nounwind optnone uwtable
define dso_local void @saxpy(i32 %n, float %a, float* %x, float* %y) #0 {
entry:
  %n.addr = alloca i32, align 4
  %a.addr = alloca float, align 4
  %x.addr = alloca float*, align 8
  %y.addr = alloca float*, align 8
  %i = alloca i32, align 4
  store i32 %n, i32* %n.addr, align 4
  store float %a, float* %a.addr, align 4
  store float* %x, float** %x.addr, align 8
  store float* %y, float** %y.addr, align 8
  store i32 0, i32* %i, align 4
  br label %for.cond

for.cond:                                         ; preds = %for.inc, %entry
  %0 = load i32, i32* %i, align 4
  %1 = load i32, i32* %n.addr, align 4
  %cmp = icmp slt i32 %0, %1
  br i1 %cmp, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %2 = load float, float* %a.addr, align 4
  %3 = load float*, float** %x.addr, align 8
  %4 = load i32, i32* %i, align 4
  %idxprom = sext i32 %4 to i64
  %arrayidx = getelementptr inbounds float, float* %3, i64 %idxprom
  %5 = load float, float* %arrayidx, align 4
  %mul = fmul float %2, %5
  %6 = load float*, float** %y.addr, align 8
  %7 = load i32, i32* %i, align 4
  %idxprom1 = sext i32 %7 to i64
  %arrayidx2 = getelementptr inbounds float, float* %6, i64 %idxprom1
  %8 = load float, float* %arrayidx2, align 4
  %add = fadd float %mul, %8
  %9 = load float*, float** %y.addr, align 8
  %10 = load i32, i32* %i, align 4
  %idxprom3 = sext i32 %10 to i64
  %arrayidx4 = getelementptr inbounds float, float* %9, i64 %idxprom3
  store float %add, float* %arrayidx4, align 4
  br label %for.inc

for.inc:                                          ; preds = %for.body
  %11 = load i32, i32* %i, align 4
  %inc = add nsw i32 %11, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret void
}

attributes #0 = { noinline nounwind optnone uwtable "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"AMD clang version 12.0.0 (CLANG: AOCC_3.0.0-Build#78 2020_12_10) (based on LLVM Mirror.Version.12.0.0)"}

和 icc 编译的向量化部分基本一样，但没有概率模型，可惜上面的概率模型的 cost model 是 intel processor 的，所以最终结果icc和aocc不分伯仲。

        .text
        .file   "a.c"
        .globl  saxpy                           # -- Begin function saxpy
        .p2align        4, 0x90
        .type   saxpy,@function
saxpy:                                  # @saxpy
        .cfi_startproc
# %bb.0:                                # %entry
        testl   %edi, %edi
        jle     .LBB0_16
# %bb.1:                                # %for.body.preheader
        movl    %edi, %r9d
        cmpl    $7, %edi
        jbe     .LBB0_2
# %bb.7:                                # %vector.memcheck
        leaq    (%rsi,%r9,4), %rax
        cmpq    %rdx, %rax
        jbe     .LBB0_9
# %bb.8:                                # %vector.memcheck
        leaq    (%rdx,%r9,4), %rax
        cmpq    %rsi, %rax
        jbe     .LBB0_9
.LBB0_2:
        xorl    %ecx, %ecx
.LBB0_3:                                # %for.body.preheader23
        movq    %rcx, %rax
        notq    %rax
        testb   $1, %r9b
        je      .LBB0_5
# %bb.4:                                # %for.body.prol
        movss   (%rsi,%rcx,4), %xmm1            # xmm1 = mem[0],zero,zero,zero
        mulss   %xmm0, %xmm1
        addss   (%rdx,%rcx,4), %xmm1
        movss   %xmm1, (%rdx,%rcx,4)
        orq     $1, %rcx
.LBB0_5:                                # %for.body.prol.loopexit
        addq    %r9, %rax
        je      .LBB0_16
        .p2align        4, 0x90
.LBB0_6:                                # %for.body
                                        # =>This Inner Loop Header: Depth=1
        movss   (%rsi,%rcx,4), %xmm1            # xmm1 = mem[0],zero,zero,zero
        mulss   %xmm0, %xmm1
        addss   (%rdx,%rcx,4), %xmm1
        movss   %xmm1, (%rdx,%rcx,4)
        movss   4(%rsi,%rcx,4), %xmm1           # xmm1 = mem[0],zero,zero,zero
        mulss   %xmm0, %xmm1
        addss   4(%rdx,%rcx,4), %xmm1
        movss   %xmm1, 4(%rdx,%rcx,4)
        addq    $2, %rcx
        cmpq    %rcx, %r9
        jne     .LBB0_6
        jmp     .LBB0_16
.LBB0_9:                                # %vector.ph
        movl    %r9d, %ecx
        andl    $-8, %ecx
        movaps  %xmm0, %xmm1
        shufps  $0, %xmm0, %xmm1                # xmm1 = xmm1[0,0],xmm0[0,0]
        leaq    -8(%rcx), %rax
        movq    %rax, %r8
        shrq    $3, %r8
        addq    $1, %r8
        testq   %rax, %rax
        je      .LBB0_10
# %bb.11:                               # %vector.ph.new
        movq    %r8, %rax
        andq    $-2, %rax
        negq    %rax
        xorl    %edi, %edi
        .p2align        4, 0x90
.LBB0_12:                               # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movups  (%rsi,%rdi,4), %xmm2
        movups  16(%rsi,%rdi,4), %xmm3
        mulps   %xmm1, %xmm2
        mulps   %xmm1, %xmm3
        movups  (%rdx,%rdi,4), %xmm4
        addps   %xmm2, %xmm4
        movups  16(%rdx,%rdi,4), %xmm2
        addps   %xmm3, %xmm2
        movups  32(%rdx,%rdi,4), %xmm3
        movups  48(%rdx,%rdi,4), %xmm5
        movups  %xmm4, (%rdx,%rdi,4)
        movups  %xmm2, 16(%rdx,%rdi,4)
        movups  32(%rsi,%rdi,4), %xmm2
        movups  48(%rsi,%rdi,4), %xmm4
        mulps   %xmm1, %xmm2
        addps   %xmm3, %xmm2
        mulps   %xmm1, %xmm4
        addps   %xmm5, %xmm4
        movups  %xmm2, 32(%rdx,%rdi,4)
        movups  %xmm4, 48(%rdx,%rdi,4)
        addq    $16, %rdi
        addq    $2, %rax
        jne     .LBB0_12
# %bb.13:                               # %middle.block.unr-lcssa
        testb   $1, %r8b
        je      .LBB0_15
.LBB0_14:                               # %vector.body.epil
        movups  (%rsi,%rdi,4), %xmm2
        movups  16(%rsi,%rdi,4), %xmm3
        mulps   %xmm1, %xmm2
        mulps   %xmm1, %xmm3
        movups  (%rdx,%rdi,4), %xmm1
        addps   %xmm2, %xmm1
        movups  16(%rdx,%rdi,4), %xmm2
        addps   %xmm3, %xmm2
        movups  %xmm1, (%rdx,%rdi,4)
        movups  %xmm2, 16(%rdx,%rdi,4)
.LBB0_15:                               # %middle.block
        cmpq    %r9, %rcx
        jne     .LBB0_3
.LBB0_16:                               # %for.cond.cleanup
        retq
.LBB0_10:
        xorl    %edi, %edi
        testb   $1, %r8b
        jne     .LBB0_14
        jmp     .LBB0_15
.Lfunc_end0:
        .size   saxpy, .Lfunc_end0-saxpy
        .cfi_endproc
                                        # -- End function
        .ident  "AMD clang version 12.0.0 (CLANG: AOCC_3.0.0-Build#78 2020_12_10) (based on LLVM Mirror.Version.12.0.0)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

Another test on NVHPC, actually you can hack the backend CPU part using AOCC with nvc -march=zen2 -Mvect=simd:256 -Mcache_align -fma -S a.c.

; ModuleID = 'a.c'
target datalayout = "e-p:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

define internal void @pgCplus_compiled.() noinline {
L.entry:
	ret void
}


define void @saxpy(i32 signext %n.arg, float %a.arg, float* %x.arg, float* %y.arg) #0 !dbg !17 {
L.entry:
	%n.addr = alloca i32, align 4
	%a.addr = alloca float, align 4
	%x.addr = alloca float*, align 8
	%y.addr = alloca float*, align 8
	%.ndi0002.addr = alloca i32, align 4
	%.ndi0003.addr = alloca i32, align 4
	%.vv0000.addr = alloca i8*, align 8
	%.vv0001.addr = alloca i8*, align 8
	%.vv0002.addr = alloca i8*, align 8
	%.r1.0148.addr = alloca <8 x float>, align 4
	%.lcr010001.addr = alloca i32, align 4

	store i32 %n.arg, i32* %n.addr, align 4, !tbaa !29
	store float %a.arg, float* %a.addr, align 4, !tbaa !29
	store float* %x.arg, float** %x.addr, align 8, !tbaa !30
	store float* %y.arg, float** %y.addr, align 8, !tbaa !30
	%0 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
	%1 = icmp sle i32  %0, 0, !dbg !23
	br i1  %1, label %L.B0005, label %L.B0014, !dbg !23
L.B0014:
	%2 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
	%3 = bitcast float*  %2 to i8*, !dbg !23
	%4 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
	%5 = bitcast float*  %4 to i8*, !dbg !23
	%6 = ptrtoint i8*  %5 to i64, !dbg !23
	%7 = sub i64 0,  %6, !dbg !23
	%8 = getelementptr i8, i8*  %3, i64  %7, !dbg !23
	%9 = icmp ule i8*  %8,  null, !dbg !23
	br i1  %9, label %L.B0008, label %L.B0015, !dbg !23
L.B0015:
	%10 = bitcast float*  %2 to i8*, !dbg !23
	%11 = bitcast float*  %4 to i8*, !dbg !23
	%12 = ptrtoint i8*  %11 to i64, !dbg !23
	%13 = sub i64 0,  %12, !dbg !23
	%14 = getelementptr i8, i8*  %10, i64  %13, !dbg !23
	%15 = inttoptr i64 32 to i8*, !dbg !23
	%16 = icmp ult i8*  %14,  %15, !dbg !23
	br i1  %16, label %L.B0007, label %L.B0008, !dbg !23
L.B0008:
	store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%17 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
	store i32  %17, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%18 = icmp slt i32  %17, 8, !dbg !23
	br i1  %18, label %L.B0011, label %L.B0016, !dbg !23
L.B0016:
	store i8* null, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
	%19 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
	%20 = bitcast float*  %19 to i8*, !dbg !23
	store i8*  %20, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23
	%21 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
	%22 = bitcast float*  %21 to i8*, !dbg !23
	store i8*  %22, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23
	%23 = sub i32  %17, 7, !dbg !23
	store i32  %23, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%24 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
	%25 = insertelement <8 x float> undef, float  %24, i32 0, !dbg !23
	%26 = shufflevector <8 x float>  %25, <8 x float> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, !dbg !23
	store <8 x float>  %26, <8 x float>* %.r1.0148.addr, align 1, !tbaa !29, !dbg !23
	br label %L.B0012
L.B0012:
	%27 = load <8 x float>, <8 x float>* %.r1.0148.addr, align 4, !tbaa !29, !dbg !23
	%28 = load i8*, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23
	%29 = load i8*, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
	%30 = ptrtoint i8*  %29 to i64, !dbg !23
	%31 = getelementptr i8, i8*  %28, i64  %30, !dbg !23
	%32 = bitcast i8*  %31 to <8 x float>*, !dbg !23
	%33 = load <8 x float>, <8 x float>*  %32, align 4, !tbaa !29, !dbg !23
	%34 = load i8*, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23
	%35 = getelementptr i8, i8*  %34, i64  %30, !dbg !23
	%36 = bitcast i8*  %35 to <8 x float>*, !dbg !23
	%37 = load <8 x float>, <8 x float>*  %36, align 4, !tbaa !29, !dbg !23
	%38 = call <8 x float> @llvm.fma.v8f32 (<8 x float>  %27, <8 x float>  %33, <8 x float>  %37), !dbg !23
	store <8 x float>  %38, <8 x float>*  %36, align 1, !tbaa !29, !dbg !23
	%39 = getelementptr i8, i8*  %29, i64 32, !dbg !23
	store i8*  %39, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
	%40 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%41 = sub i32  %40, 8, !dbg !23
	store i32  %41, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%42 = icmp sgt i32  %41, 0, !dbg !23
	br i1  %42, label %L.B0012, label %L.B0017, !llvm.loop !24, !dbg !23
L.B0017:
	%43 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%44 = add i32  %43, 7, !dbg !23
	store i32  %44, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%45 = icmp eq i32  %44, 0, !dbg !23
	br i1  %45, label %L.B0013, label %L.B0018, !dbg !23
L.B0018:
	%46 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
	%47 = and i32  %46, -8, !dbg !23
	store i32  %47, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	br label %L.B0011
L.B0011:
	%48 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%49 = sext i32  %48 to i64, !dbg !23
	%50 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
	%51 = getelementptr float, float*  %50, i64  %49, !dbg !23
	%52 = load float, float*  %51, align 4, !tbaa !29, !dbg !23
	%53 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
	%54 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
	%55 = getelementptr float, float*  %54, i64  %49, !dbg !23
	%56 = load float, float*  %55, align 4, !tbaa !29, !dbg !23
	%57 = call float @llvm.fma.f32 (float  %53, float  %56, float  %52), !dbg !23
	store float  %57, float*  %51, align 4, !tbaa !29, !dbg !23
	%58 = add i32  %48, 1, !dbg !23
	store i32  %58, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%59 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%60 = sub i32  %59, 1, !dbg !23
	store i32  %60, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%61 = icmp sgt i32  %60, 0, !dbg !23
	br i1  %61, label %L.B0011, label %L.B0013, !llvm.loop !24, !dbg !23
L.B0013:
	br label %L.B0009, !dbg !23
L.B0007:
	store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%62 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
	store i32  %62, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23
	br label %L.B0010
L.B0010:
	%63 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%64 = sext i32  %63 to i64, !dbg !23
	%65 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
	%66 = getelementptr float, float*  %65, i64  %64, !dbg !23
	%67 = load float, float*  %66, align 4, !tbaa !29, !dbg !23
	%68 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
	%69 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
	%70 = getelementptr float, float*  %69, i64  %64, !dbg !23
	%71 = load float, float*  %70, align 4, !tbaa !29, !dbg !23
	%72 = call float @llvm.fma.f32 (float  %68, float  %71, float  %67), !dbg !23
	store float  %72, float*  %66, align 4, !tbaa !29, !dbg !23
	%73 = add i32  %63, 1, !dbg !23
	store i32  %73, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%74 = load i32, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23
	%75 = icmp slt i32  %73,  %74, !dbg !23
	br i1  %75, label %L.B0010, label %L.B0009, !dbg !23
L.B0009:
	br label %L.B0005
L.B0005:
	ret void, !dbg !26
}

declare float @llvm.fma.f32(float, float, float)
declare <8 x float> @llvm.fma.v8f32(<8 x float>, <8 x float>, <8 x float>)
declare i32 @__gxx_personality_v0(...)

; Named metadata
!llvm.module.flags = !{ !1, !2 }
!llvm.dbg.cu = !{ !10 }

; Metadata
!1 = !{ i32 2, !"Dwarf Version", i32 4 }
!2 = !{ i32 2, !"Debug Info Version", i32 3 }
!3 = !DIFile(filename: "a.c", directory: "/home/victoryang")
; !4 = !DIFile(tag: DW_TAG_file_type, pair: !3)
!4 = !{ i32 41, !3 }
!5 = !{  }
!6 = !{  }
!7 = !{ !17 }
!8 = !{  }
!9 = !{  }
!10 = distinct !DICompileUnit(file: !3, language: DW_LANG_C_plus_plus, producer: " NVC++ 21.5-0", enums: !5, retainedTypes: !6, globals: !8, emissionKind: FullDebug, imports: !9)
!11 = !DIBasicType(tag: DW_TAG_base_type, name: "int", size: 32, align: 32, encoding: DW_ATE_signed)
!12 = !DIBasicType(tag: DW_TAG_base_type, name: "float", size: 32, align: 32, encoding: DW_ATE_float)
!13 = !DIDerivedType(tag: DW_TAG_pointer_type, size: 64, align: 64, baseType: !12)
!14 = !{ null, !11, !12, !13, !13 }
!15 = !DISubroutineType(types: !14)
!16 = !{  }
!17 = distinct !DISubprogram(file: !3, scope: !10, name: "saxpy", line: 2, type: !15, spFlags: 8, unit: !10, scopeLine: 2)
!18 = !DILocation(line: 2, column: 1, scope: !17)
!19 = !DILexicalBlock(file: !3, scope: !17, line: 2, column: 1)
!20 = !DILocation(line: 2, column: 1, scope: !19)
!21 = !DILexicalBlock(file: !3, scope: !19, line: 2, column: 1)
!22 = !DILocation(line: 2, column: 1, scope: !21)
!23 = !DILocation(line: 3, column: 1, scope: !21)
!24 = !{ !24, !25 }
!25 = !{ !"llvm.loop.vectorize.enable", i1 0 }
!26 = !DILocation(line: 5, column: 1, scope: !19)
!27 = !{ !"PGI C[++] TBAA" }
!28 = !{ !"omnipotent char", !27, i64 0 }
!29 = !{ !28, !28, i64 0 }
!30 = !{ !"<T>*", !28, i64 0 }
!31 = !{ !"int", !28, i64 0 }
!32 = !{ !31, !31, i64 0 }
!33 = !{ !"float", !28, i64 0 }
!34 = !{ !33, !33, i64 0 }

and

	.text
	.file	"a.ll"
	.globl	saxpy                           # -- Begin function saxpy
	.p2align	4, 0x90
	.type	saxpy,@function
saxpy:                                  # @saxpy
.Lfunc_begin0:
	.file	1 "/home/victoryang/a.c"
	.loc	1 2 0                           # a.c:2:0
	.cfi_sections .debug_frame
	.cfi_startproc
# %bb.0:                                # %L.entry
	.loc	1 3 1 prologue_end              # a.c:3:1
	testl	%edi, %edi
	jle	.LBB0_19
# %bb.1:                                # %L.B0014
	movq	%rdx, %rax
	subq	%rsi, %rax
	je	.LBB0_11
# %bb.2:                                # %L.B0014
	cmpq	$31, %rax
	ja	.LBB0_11
# %bb.3:                                # %L.B0010.preheader
	movl	%edi, %eax
	cmpl	$31, %edi
	jbe	.LBB0_4
# %bb.5:                                # %vector.memcheck
	leaq	(%rsi,%rax,4), %rcx
	cmpq	%rdx, %rcx
	jbe	.LBB0_7
# %bb.6:                                # %vector.memcheck
	.loc	1 0 1 is_stmt 0                 # a.c:0:1
	leaq	(%rdx,%rax,4), %rcx
	.loc	1 3 1                           # a.c:3:1
	cmpq	%rsi, %rcx
	jbe	.LBB0_7
.LBB0_4:
	.loc	1 0 1                           # a.c:0:1
	xorl	%ecx, %ecx
	.p2align	4, 0x90
.LBB0_10:                               # %L.B0010
                                        # =>This Inner Loop Header: Depth=1
	.loc	1 3 1                           # a.c:3:1
	vmovss	(%rsi,%rcx,4), %xmm1            # xmm1 = mem[0],zero,zero,zero
	vfmadd213ss	(%rdx,%rcx,4), %xmm0, %xmm1 # xmm1 = (xmm0 * xmm1) + mem
	vmovss	%xmm1, (%rdx,%rcx,4)
	incq	%rcx
	cmpq	%rcx, %rax
	jne	.LBB0_10
	jmp	.LBB0_19
.LBB0_11:                               # %L.B0008
	.loc	1 0 1                           # a.c:0:1
	xorl	%ecx, %ecx
	.loc	1 3 1                           # a.c:3:1
	cmpl	$8, %edi
	jge	.LBB0_13
# %bb.12:
	.loc	1 0 1                           # a.c:0:1
	movl	%edi, %eax
	jmp	.LBB0_17
.LBB0_13:                               # %L.B0016
	.loc	1 3 1                           # a.c:3:1
	vbroadcastss	%xmm0, %ymm1
	xorl	%ecx, %ecx
	movl	%edi, %eax
	.p2align	4, 0x90
.LBB0_14:                               # %L.B0012
                                        # =>This Inner Loop Header: Depth=1
	vmovups	(%rsi,%rcx), %ymm2
	movl	%eax, %r8d
	vfmadd213ps	(%rdx,%rcx), %ymm1, %ymm2 # ymm2 = (ymm1 * ymm2) + mem
	leal	-8(%r8), %eax
	addl	$-7, %r8d
	vmovups	%ymm2, (%rdx,%rcx)
	addq	$32, %rcx
	cmpl	$8, %r8d
	jg	.LBB0_14
# %bb.15:                               # %L.B0017
	testl	%eax, %eax
	je	.LBB0_19
# %bb.16:                               # %L.B0018
	andl	$-8, %edi
	movl	%edi, %ecx
.LBB0_17:                               # %L.B0011.preheader
	incl	%eax
	.p2align	4, 0x90
.LBB0_18:                               # %L.B0011
                                        # =>This Inner Loop Header: Depth=1
	movslq	%ecx, %rcx
	decl	%eax
	vmovss	(%rsi,%rcx,4), %xmm1            # xmm1 = mem[0],zero,zero,zero
	vfmadd213ss	(%rdx,%rcx,4), %xmm0, %xmm1 # xmm1 = (xmm0 * xmm1) + mem
	vmovss	%xmm1, (%rdx,%rcx,4)
	incl	%ecx
	cmpl	$1, %eax
	jg	.LBB0_18
.Ltmp0:
.LBB0_19:                               # %L.B0005
	.loc	1 5 1 is_stmt 1                 # a.c:5:1
	vzeroupper
	retq
.LBB0_7:                                # %vector.ph
.Ltmp1:
	.loc	1 3 1                           # a.c:3:1
	vbroadcastss	%xmm0, %ymm1
	movl	%eax, %ecx
	xorl	%edi, %edi
	andl	$-32, %ecx
	.p2align	4, 0x90
.LBB0_8:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovups	(%rsi,%rdi,4), %ymm2
	vmovups	32(%rsi,%rdi,4), %ymm3
	vmovups	64(%rsi,%rdi,4), %ymm4
	vmovups	96(%rsi,%rdi,4), %ymm5
	vfmadd213ps	(%rdx,%rdi,4), %ymm1, %ymm2 # ymm2 = (ymm1 * ymm2) + mem
	vfmadd213ps	32(%rdx,%rdi,4), %ymm1, %ymm3 # ymm3 = (ymm1 * ymm3) + mem
	vfmadd213ps	64(%rdx,%rdi,4), %ymm1, %ymm4 # ymm4 = (ymm1 * ymm4) + mem
	vfmadd213ps	96(%rdx,%rdi,4), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
	vmovups	%ymm2, (%rdx,%rdi,4)
	vmovups	%ymm3, 32(%rdx,%rdi,4)
	vmovups	%ymm4, 64(%rdx,%rdi,4)
	vmovups	%ymm5, 96(%rdx,%rdi,4)
	addq	$32, %rdi
	cmpq	%rdi, %rcx
	jne	.LBB0_8
# %bb.9:                                # %middle.block
	cmpq	%rax, %rcx
	jne	.LBB0_10
	jmp	.LBB0_19
.Ltmp2:
.Lfunc_end0:
	.size	saxpy, .Lfunc_end0-saxpy
	.cfi_endproc
                                        # -- End function
	.section	.debug_abbrev,"",@progbits
	.byte	1                               # Abbreviation Code
	.byte	17                              # DW_TAG_compile_unit
	.byte	1                               # DW_CHILDREN_yes
	.byte	37                              # DW_AT_producer
	.byte	14                              # DW_FORM_strp
	.byte	19                              # DW_AT_language
	.byte	5                               # DW_FORM_data2
	.byte	3                               # DW_AT_name
	.byte	14                              # DW_FORM_strp
	.byte	16                              # DW_AT_stmt_list
	.byte	23                              # DW_FORM_sec_offset
	.byte	27                              # DW_AT_comp_dir
	.byte	14                              # DW_FORM_strp
	.ascii	"\264B"                         # DW_AT_GNU_pubnames
	.byte	25                              # DW_FORM_flag_present
	.byte	17                              # DW_AT_low_pc
	.byte	1                               # DW_FORM_addr
	.byte	18                              # DW_AT_high_pc
	.byte	6                               # DW_FORM_data4
	.byte	0                               # EOM(1)
	.byte	0                               # EOM(2)
	.byte	2                               # Abbreviation Code
	.byte	46                              # DW_TAG_subprogram
	.byte	0                               # DW_CHILDREN_no
	.byte	17                              # DW_AT_low_pc
	.byte	1                               # DW_FORM_addr
	.byte	18                              # DW_AT_high_pc
	.byte	6                               # DW_FORM_data4
	.byte	64                              # DW_AT_frame_base
	.byte	24                              # DW_FORM_exprloc
	.byte	3                               # DW_AT_name
	.byte	14                              # DW_FORM_strp
	.byte	58                              # DW_AT_decl_file
	.byte	11                              # DW_FORM_data1
	.byte	59                              # DW_AT_decl_line
	.byte	11                              # DW_FORM_data1
	.byte	63                              # DW_AT_external
	.byte	25                              # DW_FORM_flag_present
	.byte	0                               # EOM(1)
	.byte	0                               # EOM(2)
	.byte	0                               # EOM(3)
	.section	.debug_info,"",@progbits
.Lcu_begin0:
	.long	.Ldebug_info_end0-.Ldebug_info_start0 # Length of Unit
.Ldebug_info_start0:
	.short	4                               # DWARF version number
	.long	.debug_abbrev                   # Offset Into Abbrev. Section
	.byte	8                               # Address Size (in bytes)
	.byte	1                               # Abbrev [1] 0xb:0x35 DW_TAG_compile_unit
	.long	.Linfo_string0                  # DW_AT_producer
	.short	4                               # DW_AT_language
	.long	.Linfo_string1                  # DW_AT_name
	.long	.Lline_table_start0             # DW_AT_stmt_list
	.long	.Linfo_string2                  # DW_AT_comp_dir
                                        # DW_AT_GNU_pubnames
	.quad	.Lfunc_begin0                   # DW_AT_low_pc
	.long	.Lfunc_end0-.Lfunc_begin0       # DW_AT_high_pc
	.byte	2                               # Abbrev [2] 0x2a:0x15 DW_TAG_subprogram
	.quad	.Lfunc_begin0                   # DW_AT_low_pc
	.long	.Lfunc_end0-.Lfunc_begin0       # DW_AT_high_pc
	.byte	1                               # DW_AT_frame_base
	.byte	87
	.long	.Linfo_string3                  # DW_AT_name
	.byte	1                               # DW_AT_decl_file
	.byte	2                               # DW_AT_decl_line
                                        # DW_AT_external
	.byte	0                               # End Of Children Mark
.Ldebug_info_end0:
	.section	.debug_str,"MS",@progbits,1
.Linfo_string0:
	.asciz	" NVC++ 21.5-0"                 # string offset=0
.Linfo_string1:
	.asciz	"a.c"                           # string offset=14
.Linfo_string2:
	.asciz	"/home/victoryang"              # string offset=18
.Linfo_string3:
	.asciz	"saxpy"                         # string offset=35
	.section	.debug_pubnames,"",@progbits
	.long	.LpubNames_end0-.LpubNames_begin0 # Length of Public Names Info
.LpubNames_begin0:
	.short	2                               # DWARF Version
	.long	.Lcu_begin0                     # Offset of Compilation Unit Info
	.long	64                              # Compilation Unit Length
	.long	42                              # DIE offset
	.asciz	"saxpy"                         # External Name
	.long	0                               # End Mark
.LpubNames_end0:
	.section	.debug_pubtypes,"",@progbits
	.long	.LpubTypes_end0-.LpubTypes_begin0 # Length of Public Types Info
.LpubTypes_begin0:
	.short	2                               # DWARF Version
	.long	.Lcu_begin0                     # Offset of Compilation Unit Info
	.long	64                              # Compilation Unit Length
	.long	0                               # End Mark
.LpubTypes_end0:
	.section	".note.GNU-stack","",@progbits
	.section	.debug_line,"",@progbits
.Lline_table_start0:

GCC arm SVE

对于超算来说应该介绍 Arm FX 64 的。但笔者觉得还是先科普一下 SVE 比较好，说不定下一次 ISC 就有了。

saxpy with neon

// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
saxpy_:
   ldrswx3, [x3] // x3=*n
   movx4, #0     // x4=i=0
   ldrd0, [x2]   // d0=*a
   b     .latch
.loop:
   ldrd1, [x0, x4, lsl #3]// d1=x[i]
   ldrd2, [x1, x4, lsl #3]// d2=y[i]
   fmaddd2, d1, d0, d2.   // d2+=x[i]*a
   strd2, [x1, x4, lsl #3]// y[i]=d2
   addx4, x4, #1          // i+=1
.latch:
   cmpx4, x3  // i < n
   b.lt  .loop// more to do?
   ret

saxpy with sve

// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
saxpy_:
   ldrswx3, [x3]        // x3=*n
   movx4, #0            // x4=i=0
   whileltp0.d, x4, x3  // p0=while(i++<n)
   ld1rdz0.d, p0/z, [x2]// p0:z0=bcast(*a)
.loop:
   ld1dz1.d, p0/z, [x0, x4, lsl #3]// p0:z1=x[i]
   ld1dz2.d, p0/z, [x1, x4, lsl #3]// p0:z2=y[i]
   fmlaz2.d, p0/m, z1.d, z0.d      // p0?z2+=x[i]*a
   st1dz2.d, p0, [x1, x4, lsl #3]  // p0?y[i]=z2
   incdx4                          // i+=(VL/64)
.latch:
   whileltp0.d, x4, x3             // p0=while(i++<n)
   b.first .loop                   // more to do?
   ret

There is no overhead in instruction count for the SVE version when compared to the equivalent scalar code, which allows a compiler to opportunistically vectorize loops withan unknown trip count.

16个可伸缩预测寄存器(P0-P15）：普通的内存和算数操作的控制仅限于P0-P7。但是生成predicate的指令（向量比较）和依赖predicate的指令（逻辑操作）会使用全部寄存器P0-P15。通过分析编译和手动优化，这样的分配方案被验证有效并且减轻了predicate寄存器在其它架构上被观察到的压力
mixed element尺寸控制：每个predicate寄存器允许将最低粒度降低到字节水平，所以每个bit位对应8blt的数据宽度。
predicate条件：在SVE中产生predication的指令是复用NZCV条件码flags,这个NZCV有不同的解释
隐式顺序：predicate有一个隐式地从最低到最高位元素顺序解释，对应于一个等效的序列顺序。我们引用与此顺序有关的第一个和最后一个predicate elements以及它们的关联条件。

whileltp0 is to predict before the last max alignment which may cause throughput drain. OoO may shot the last element with low occpancy which lead to waste of this shot, alternatively, it could shot lower other (2^n) aligned instruction.

For hazard execution and speculation, you could easily doing gather load with z3 reg fault trap and reload.

OpenMP

The compiler support the openmp by default. The OpenMP standard for specifying threading in programming languages like C and Fortran is implemented in the compiler itself and as such is an integral part of the compiler in question. The OMP and POSIX thread library underneath can vary, but this is normally hidden from the user. OpenMP makes use of POSIX threads so both an OpenMP library and a POSIX thread library is needed. The POSIX thread library is normally supplied with the distribution (typically /usr/lib64/libpthread.so).

\[ \begin{array}{|l|c|c|} \hline \text { Compiler } & \text { Flag to select OpenMP } & \text { OpenMP version supported } \\ \hline \text { Intel compilers } & \text {-qopenmp } & \text { From } 17.0 \text { on : } 4.5 \\ \hline \text { GNU compilers } & \text {-fopenmp } & \text { From GCC 6.1 on : 4.5 } \\ \hline \text { PGI compilers } & -\mathrm{mp} & 4.5 \\ \hline \end{array} \]

You definitely need to watch Fanrui's PPT and understand the implementation of OpenMP in Clang.

Ref

https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/lecture-slides/MIT6_172F18_lec9.pdf
程序员的自我修养
不同编译器的编译行为比较
The ARM Scalable Vector Extension
https://www.stonybrook.edu/commcms/ookami/support/_docs/1%20-%20Intro%20to%20A64FX.pdf
https://llvm.org/devmtg/2017-10/slides/Ferguson-Enabling%20Parallel%20Computing%20in%20Chapel.pdf

Chisel

有关我们是否有必要学习一门逻辑电路描述语言去实现一个 CPU 或者一个路由器？我认为是十分必要的，这让大家可以从底往上看你的 Data 和 Instruction 的变动。

Fortran

这个语言简直伤天害理，但确是笔者前老板最爱的语言，而且是f77，天灭fortran。可超算比赛 fortran 的上镜次数还挺多的，之前做天气应用的时候没敢改，现在既然有个小于 15w 行的程序，笔者尝试着修改。

PGI 编译器是一个商用编译器。后被 NV 收购，加了很多 fortran 可用的 cuda DSL。这无疑让 fortran 续命了不少。NVHPC 中的 Nvfortran 有很多编译器优化的 log 可以看。

基本语法： module 相当于 c 中的struct。 program(main)/function(normal function) 相当于对function 的定义

real function square(x)
    implicit none
    real, intent(in) :: x
    square = x * x
    return
end function square
program main
  integer :: n, i, errs, argcount
  real, dimension(:), allocatable :: a, b, r, e
  n = 1000000 
  call square(n)
end program

subroutine 相当于trait，需要有generic function 来实现

OpenACC

一个简单的加法

module mpoint
type point
    real :: x, y, z
end type
type(point) :: base(1000)
end module

subroutine vecaddgpu( r, n )
 use mpoint
 type(point) :: r(:)
 integer :: n
 !$acc parallel loop present(base) copyout(r(:))
 do i = 1, n
  r(i)%x = base(i)%x
  r(i)%y = sqrt( base(i)%y*base(i)%y + base(i)%z*base(i)%z )
  r(i)%z = 0
 enddo
end subroutine

Mind to use Makefile to see the Optimization info from the compiler. Also checkt the identifier loop present and copyout specify the gpu to run on.

nvfortran -MInfo -Mbounds

During the runtime, you can see the symbol and source file and using which GPU.

NVCOMPILER_ACC_NOTIFY=1 /root/yyw/cmake-openacc/cmake-build-debug-nvhpc/acc_test

Let's compared with the Kernel version. Both option -O0 -g

#include <iostream>
#include <cassert>
#include <cuda_runtime.h>

__global__ void vecaddgpu(int **a, int **b, int **c, int i) {
    *c[i] = *a[i] + *b[i];
}

int main(void) {
    int n = 1000000000;

    int *a = static_cast<int *>(malloc(n * sizeof(int)));
    int *b = static_cast<int *>(malloc(n * sizeof(int)));
    int *c = static_cast<int *>(malloc(n * sizeof(int))); // host copies of a, b, c
    int *e = static_cast<int *>(malloc(n * sizeof(int))); // result
    int **d_a, **d_b, **d_c;   // device copies of a, b, c
    int size = sizeof(int);
    int err = 0;

    for (int i = 0; i < n; i++) {
        a[i] = i;
        b[i] = 1000 * i;
        e[i] = a[i] + b[i];
    }

    // Allocate space for device copies of a, b, c
    cudaMalloc((void **) &d_a, size * n);
    cudaMalloc((void **) &d_b, size * n);
    cudaMalloc((void **) &d_c, size * n);

    // Copy inputs to device
    cudaMemcpy(d_a, reinterpret_cast<const void *>(a), size * n, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, reinterpret_cast<const void *>(b), size * n, cudaMemcpyHostToDevice);
    // Launch vecaddgpu() kernel on GPU with N blocks
    vecaddgpu<<<1, 1024>>>(d_a, d_b, d_c, n);
    // Copy result back to host
    cudaMemcpy(c, d_c, size * n, cudaMemcpyDeviceToHost);
    // Cleanup
    for (int i = 0; i < n; i++) {
        if (c[i] != e[i])
            err++;
    }
    free(a);
    free(b);
    free(c);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    return 0;
}

效率对比

pure cuda kernel is 1.5x faster.

The compiler option between different compiler

Go

此语言是个非常简单上手的语言，同时由于大厂使用过多，市面上开源的工具都非常可用。笔者在实习的时候学习的，module 等包管理工具在并行文件系统上面的 channel + context 重新实现非常快，那个代码也就数十行处理一些并行质询的状态机即可，也用其写过一些 eBPF 的代码用于更好的获得文件 IO 的实时性能。

一些小工具

Rust

Rust 是好活，自从我校2016年以 Rust 语言开设 CS100（程序语言设计）开始，上科大就成为了宣传Rust的堡垒，中国 Rust 之父张汉东先生及宣发 Rust 的各界人士选择推广 Rust 的最佳地点就会选择上科大，这雨与 riscv 类似。写小的bash工具（ Zero Cost Abstraction 的 cffi ）、写大型（10w+ line）的系统方向程序必备。由于语言特性有很多的静态检查，会指导大家对于内存管理，异步编程有更深刻的理解。

学习参考资料

rCore - 清华维护的教学操作系统
Libra - 脸书维护的区块链数据库
飞书 - Tokio 代码池

异步中的`Sync`和 `Send`

https://kaisery.github.io/trpl-zh-cn/ch16-04-extensible-concurrency-sync-and-send.html

`Send`

use std::rc::Rc;
use std::sync::Mutex;
use std::thread;

fn main() {
    let num = Rc::new(Mutex::new(0));
    let mut handlers = vec![];
    for i in 1..10 {
        let num_copy = num.clone();
        let handle = thread::spawn(move || {
            *num_copy.lock().unwrap() += 1;
        });
        handlers.push(handle);
    }
    for handler in handlers {
        handler.join();
    }
    println!("{}", num.lock().unwrap());
}

这段代目不能通过编译, 原因是num_copy在move到线程中时，可能会多线程同时修改引用计数。所以在Rust中，Rc没有Send trait,因为它不允许在线程间转移所有权。

error[E0277]: `Rc<Mutex<i32>>` cannot be sent between threads safely
   --> src/main.rs:10:22
    |
10  |           let handle = thread::spawn(move || {
    |  ______________________^^^^^^^^^^^^^_-
    | |                      |
    | |                      `Rc<Mutex<i32>>` cannot be sent between threads safely
11  | |             *num_copy.lock().unwrap() += 1;
12  | |         });
    | |_________- within this `[closure@src/main.rs:10:36: 12:10]`
    | 
   ::: /home/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:624:8
    |
624 |       F: Send + 'static,
    |          ---- required by this bound in `spawn`
    |
    = help: within `[closure@src/main.rs:10:36: 12:10]`, the trait `Send` is not implemented for `Rc<Mutex<i32>>`
    = note: required because it appears within the type `[closure@src/main.rs:10:36: 12:10]`

`Sync`

use std::cell::Cell;
use std::thread;

fn main() {
    let num = Cell::new(0);
    let mut handlers = vec![];
    for i in 1..10 {
        let handle = thread::spawn(|| {
            num.set(i);
        });
        handlers.push(handle);
    }
    for h in handlers {
        h.join();
    }
    println!("{}", num.get());
}

这段代目不能通过编译, 原因是Cell<T> 我们在多个线程中共享&Cell<T>时, 多可线程可以并发地修改其内部的值，这并不安全。所以Cell<T> 实现了!Sync trait。

error[E0277]: `Cell<i32>` cannot be shared between threads safely
   --> src/main.rs:8:22
    |
8   |         let handle = thread::spawn(|| {
    |                      ^^^^^^^^^^^^^ `Cell<i32>` cannot be shared between threads safely
    | 
   ::: /home/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:624:8
    |
624 |     F: Send + 'static,
    |        ---- required by this bound in `spawn`
    |
    = help: the trait `Sync` is not implemented for `Cell<i32>`
    = note: required because of the requirements on the impl of `Send` for `&Cell<i32>`
    = note: required because it appears within the type `[closure@src/main.rs:8:36: 10:10]`

Libs

这里放置一些常用库的安装和使用事项。

SVML

fuck numa in linux but fine in epyc

Every 2.0s: numastat                                                                epyc.node1: Mon Aug 30 07:17:40 2021

                           node0           node1
numa_hit             11605557098     17090418391
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit             83929           83526
local_node           11605248266     17089868634
other_node                308832          549757

Boost

website

Spack

spack info boost
spack install boost

Source

./bootstrap.sh --help
# Select your configuration options and invoke ./bootstrap.sh again without the --help option. Unless you have write permission in your system's /usr/local/ directory, you'll probably want to at least use

./bootstrap.sh --prefix=path/to/installation/prefix
# to install somewhere else. Also, consider using the --show-libraries and --with-libraries=library-name-list options to limit the long wait you'll experience if you build everything. Finally,

./b2 install
# will leave Boost binaries in the lib/ subdirectory of your installation prefix. You will also find a copy of the Boost headers in the include/ subdirectory of the installation prefix, so you can henceforth use that directory as an #include path in place of the Boost root directory.

# and add to PATH and LD and INCLUDE

版本相关问题

This is Version 3 of the Filesystem library. Version 2 is not longer supported. 1.49.0 was the last release of Boost to supply Version 2。

ArmForge

Arm Forge 是一个 Arm 公司出品的对高性能程序的软件。最强大的地方就是他对 CPU GPU 都非常适用，包括 Arm DDT 和 Arm MAP 的工具。Arm DDT 是业界领先的并行调试器，支持 MPI、CUDA 和 OpenMP。Arm MAP是用于MPI、OpenMP和 Vectorized 程序的低开销线级剖析器。

下载 Arm Forge（DDT + MAP），已使用 spack 部署
Arm Forge用户指南

uProf

AMD 出品的一款 perf 工具，增加了一些 X86 超集的 metrics，但 UI 比较丑。

x86 subset

You can refer to the CONFIG_X86_AMD_PSTATE tuning on ZenStates-Linux. The perf subset could be found in the linux perf tools. Also check /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver to see if the governor is available. The feature is switched on from Linux 5.17 (Ubuntu 22.04).

Reference

https://faculty.sites.uci.edu/zhouli/files/2022/01/oakland22.pdf
https://indico.cern.ch/event/730908/contributions/3153163/attachments/1730954/2810149/epyc.pdf
https://www.nextplatform.com/2019/08/15/a-deep-dive-into-amds-rome-epyc-architecture/
https://github.com/FPSG-UIUC/lotr

Vtune

都说做体系结构的人最懂做 Profiling，一个好的 Profiling 工具一定是有好的对 CPU 实时性能的抽象，最简单的命令是 rdstc， arm上也有类似的实现

ABI 支持

profiler 需要有对 Intel Proceccor 各种 metrics 的函数实现，在安装Vtune的过程中会编译对当前系统的 PMU 的动态链接库，意为对 Intel PMU 的 ABI支持，我队使用的 epyc 有epyc适用的魔改版 PMU Tools。现在体系结构安全界学术主流对 PMU 的研究很深，因为其泄露了部分对 CPU 实时的状态，可以从中获取想要的东西。

X86 需要支持的 perf 参数比较有限，linux从kprobe，uprobe这些官方支持的 microprocessor sampling有很有用的，这些已被eBPF所采用。

Intel Compiler 在编译 broadwell 以上架构优化时主要做了三件对性能影响很大的事情：

激进的跨 basic block 优化 + Vectorization + Loop Unroll
Load 和 Store 在满足 TSO 条件下的激进的重排，同时激进的整合数据，支持 store buffer bypass，movnt。同时也是 icc 后端 Bug 的主要来源。也是大厂不太用他的原因。除 HPC 外，大家一般照 gcc 标准。
自己维护的 TBB 线程池（非常快），自己维护的 malloc_align，自己维护的相关库。

有关如何更好的适配 Intel 的CPU，可以参考 Lammps 的 Intel Package。其使用访问者模式对 Intel processor的寄存器资源。

有关对用户态文件系统的适配，Vtune 提供了对 PM 带宽的实时测试，这个 metrics 貌似很难拿到。

在集群上如何使用

spack load intel-parallel-studio # choose the right version
amplxe-cl

Spack 简单教程

❯ spack info gcc
AutotoolsPackage:   gcc

Description:
    The GNU Compiler Collection includes front ends for C, C++, Objective-C,
    Fortran, Ada, and Go, as well as libraries for these languages.

Homepage: https://gcc.gnu.org

Maintainers: @michaelkuhn @alalazo

Externally Detectable:
    True (version, variants)

Tags:
    None

Preferred version:
    11.2.0    https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz

Safe versions:
    master    [git] git://gcc.gnu.org/git/gcc.git on branch master
    11.2.0    https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz
    11.1.0    https://ftpmirror.gnu.org/gcc/gcc-11.1.0/gcc-11.1.0.tar.xz
    10.3.0    https://ftpmirror.gnu.org/gcc/gcc-10.3.0/gcc-10.3.0.tar.xz
    10.2.0    https://ftpmirror.gnu.org/gcc/gcc-10.2.0/gcc-10.2.0.tar.xz
    10.1.0    https://ftpmirror.gnu.org/gcc/gcc-10.1.0/gcc-10.1.0.tar.xz
    9.4.0     https://ftpmirror.gnu.org/gcc/gcc-9.4.0/gcc-9.4.0.tar.xz
    9.3.0     https://ftpmirror.gnu.org/gcc/gcc-9.3.0/gcc-9.3.0.tar.xz
    9.2.0     https://ftpmirror.gnu.org/gcc/gcc-9.2.0/gcc-9.2.0.tar.xz
    9.1.0     https://ftpmirror.gnu.org/gcc/gcc-9.1.0/gcc-9.1.0.tar.xz
    8.5.0     https://ftpmirror.gnu.org/gcc/gcc-8.5.0/gcc-8.5.0.tar.xz
    8.4.0     https://ftpmirror.gnu.org/gcc/gcc-8.4.0/gcc-8.4.0.tar.xz
    8.3.0     https://ftpmirror.gnu.org/gcc/gcc-8.3.0/gcc-8.3.0.tar.xz
    8.2.0     https://ftpmirror.gnu.org/gcc/gcc-8.2.0/gcc-8.2.0.tar.xz
    8.1.0     https://ftpmirror.gnu.org/gcc/gcc-8.1.0/gcc-8.1.0.tar.xz
    7.5.0     https://ftpmirror.gnu.org/gcc/gcc-7.5.0/gcc-7.5.0.tar.xz
    7.4.0     https://ftpmirror.gnu.org/gcc/gcc-7.4.0/gcc-7.4.0.tar.xz
    7.3.0     https://ftpmirror.gnu.org/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz
    7.2.0     https://ftpmirror.gnu.org/gcc/gcc-7.2.0/gcc-7.2.0.tar.xz
    7.1.0     https://ftpmirror.gnu.org/gcc/gcc-7.1.0/gcc-7.1.0.tar.bz2
    6.5.0     https://ftpmirror.gnu.org/gcc/gcc-6.5.0/gcc-6.5.0.tar.bz2
    6.4.0     https://ftpmirror.gnu.org/gcc/gcc-6.4.0/gcc-6.4.0.tar.bz2
    6.3.0     https://ftpmirror.gnu.org/gcc/gcc-6.3.0/gcc-6.3.0.tar.bz2
    6.2.0     https://ftpmirror.gnu.org/gcc/gcc-6.2.0/gcc-6.2.0.tar.bz2
    6.1.0     https://ftpmirror.gnu.org/gcc/gcc-6.1.0/gcc-6.1.0.tar.bz2
    5.5.0     https://ftpmirror.gnu.org/gcc/gcc-5.5.0/gcc-5.5.0.tar.bz2
    5.4.0     https://ftpmirror.gnu.org/gcc/gcc-5.4.0/gcc-5.4.0.tar.bz2
    5.3.0     https://ftpmirror.gnu.org/gcc/gcc-5.3.0/gcc-5.3.0.tar.bz2
    5.2.0     https://ftpmirror.gnu.org/gcc/gcc-5.2.0/gcc-5.2.0.tar.bz2
    5.1.0     https://ftpmirror.gnu.org/gcc/gcc-5.1.0/gcc-5.1.0.tar.bz2
    4.9.4     https://ftpmirror.gnu.org/gcc/gcc-4.9.4/gcc-4.9.4.tar.bz2
    4.9.3     https://ftpmirror.gnu.org/gcc/gcc-4.9.3/gcc-4.9.3.tar.bz2
    4.9.2     https://ftpmirror.gnu.org/gcc/gcc-4.9.2/gcc-4.9.2.tar.bz2
    4.9.1     https://ftpmirror.gnu.org/gcc/gcc-4.9.1/gcc-4.9.1.tar.bz2
    4.8.5     https://ftpmirror.gnu.org/gcc/gcc-4.8.5/gcc-4.8.5.tar.bz2
    4.8.4     https://ftpmirror.gnu.org/gcc/gcc-4.8.4/gcc-4.8.4.tar.bz2
    4.7.4     https://ftpmirror.gnu.org/gcc/gcc-4.7.4/gcc-4.7.4.tar.bz2
    4.6.4     https://ftpmirror.gnu.org/gcc/gcc-4.6.4/gcc-4.6.4.tar.bz2
    4.5.4     https://ftpmirror.gnu.org/gcc/gcc-4.5.4/gcc-4.5.4.tar.bz2

Variants:
    Name [Default]               Allowed values          Description
    =========================    ====================    ===================================================

    binutils [off]               on, off                 Build via binutils
    bootstrap [on]               on, off                 Enable 3-stage bootstrap
    graphite [off]               on, off                 Enable Graphite loop optimizations (requires ISL)
    languages [c,c++,fortran]    ada, brig, c, c++,      Compilers and runtime libraries to build
                                 fortran, go, java,
                                 jit, lto, objc,
                                 obj-c++
    nvptx [off]                  on, off                 Target nvptx offloading to NVIDIA GPUs
    piclibs [off]                on, off                 Build PIC versions of libgfortran.a and libstdc++.a
    strip [off]                  on, off                 Strip executables to reduce installation size

Installation Phases:
    autoreconf    configure    build    install

Build Dependencies:
    binutils  cuda  diffutils  flex  gmp  gnat  iconv  isl  mpc  mpfr  zip  zlib  zstd

Link Dependencies:
    binutils  cuda  gmp  gnat  iconv  isl  mpc  mpfr  zlib  zstd

Run Dependencies:
    binutils

Virtual Packages:
    gcc@7: languages=go provides golang@:1.8
    gcc@6: languages=go provides golang@:1.6.1
    gcc@5: languages=go provides golang@:1.4
    gcc@4.9: languages=go provides golang@:1.2
    gcc@4.8.2: languages=go provides golang@:1.1.2
    gcc@4.8: languages=go provides golang@:1.1
    gcc@4.7.1: languages=go provides golang@:1
    gcc@4.6: languages=go provides golang

得到相关依赖，可以查看你现在如果安装 spack 提供的依赖

❯ spack spec gcc  
Input spec
--------------------------------
gcc

Concretized
--------------------------------
gcc@11.2.0%apple-clang@12.0.5~binutils+bootstrap~graphite~nvptx~piclibs~strip languages=c,c++,fortran patches=ecc5ac43951b34cbc5db15f585b4e704c42e2e487f9ed4c24fadef3f3857930b arch=darwin-bigsur-skylake
    ^diffutils@2.8.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake
    ^gmp@6.2.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^autoconf@2.71%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^automake@1.16.4%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^libtool@2.4.6%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^m4@1.4.6%apple-clang@12.0.5+sigsegv patches=c0a408fbffb7255fcc75e26bd8edab116fc81d216bfd18b473668b7739a4158e arch=darwin-bigsur-skylake
    ^libiconv@1.16%apple-clang@12.0.5 arch=darwin-bigsur-skylake
    ^mpc@1.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^mpfr@4.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake
            ^autoconf-archive@2019.01.06%apple-clang@12.0.5 arch=darwin-bigsur-skylake
            ^texinfo@4.8%apple-clang@12.0.5 arch=darwin-bigsur-skylake
    ^zlib@1.2.11%apple-clang@12.0.5+optimize+pic+shared arch=darwin-bigsur-skylake
    ^zstd@1.5.0%apple-clang@12.0.5~ipo~legacy~lz4~lzma~multithread+programs+shared+static~zlib build_type=RelWithDebInfo arch=darwin-bigsur-skylake
        ^cmake@3.21.1%apple-clang@12.0.5~doc+ncurses+openssl+ownlibs~qt build_type=Release arch=darwin-bigsur-skylake

如果想使用特定依赖或者依赖系统的包，会在 ~/.spack/package 下得到，其使用方法可见 AMD

❯ spack external find

❯ cat ~/.spack/package.json
       │ File: /Users/victoryang/.spack/packages.yaml
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ packages:
   2   │   autoconf:
   3   │     externals:
   4   │     - spec: autoconf@2.71
   5   │       prefix: /usr/local
   6   │   automake:
   7   │     externals:
   8   │     - spec: automake@1.16.4
   9   │       prefix: /usr/local
  10   │   bash:
  11   │     externals:
  12   │     - spec: bash@3.2.57
  13   │       prefix: /
  14   │   bazel:
  15   │     externals:
  16   │     - spec: bazel@4.1.0
  17   │       prefix: /usr/local
  18   │   bison:
  19   │     externals:
  20   │     - spec: bison@2.3
  21   │       prefix: /usr
  22   │   bzip2:
  23   │     externals:
  24   │     - spec: bzip2@1.0.6
  25   │       prefix: /usr
  26   │   cmake:
  27   │     externals:
  28   │     - spec: cmake@3.21.1
  29   │       prefix: /usr/local
  30   │   diffutils:
  31   │     externals:
  32   │     - spec: diffutils@2.8.1
  33   │       prefix: /usr
...

安装后的参数大致需要的有几个，-j N 是 job 个数，--no-checksum 不检查文件 md5，--no-restage 为在修改过的文件后继续编译，一般在 /tmp/root/spack-stage/spack-stage-amdscalapack-3.0-qwvyrumhsizxiaujwdsppcovijr5k5ri/spack-src/. 有些包有 cflags cxxflags fcflags。有些有 cuda_arch。碰到新的软件可以把所有需要的参数 append 到里面。

❯ spack install -j 8 --no-checksum llvm+mlir+flang+all_targets+python+shared_libs cflags="-O3" cxxflags="-O3"
[+] /usr/local (external cmake-3.21.1-cdhzbrts4k5ylrvlpspfl75zgeht4swi)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libiconv-1.16-ropgshv657ooz7kfzojv4s6srscgimnw
[+] /usr/local (external pkg-config-0.29.2-4nv7fo7lbjybt2u3xzb2vxzvgvaz5xmw)
[+] /usr/local (external xz-5.2.5-p37wr6fna4ysoh2xn2wnmmzttm3bi37o)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/zlib-1.2.11-lci2s4zd6x77rmexa3uuarbl5cvneskw
[+] /usr (external perl-5.30.2-4zkfgqml35km4ly7xmxn7ooz44dxtgqp)
[+] /usr/local (external python-3.9.6-shbb7dthsqe4lu26jugddyi2k7pl3jbl)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/pcre-8.44-g4df4jqpudoxhjsrubrqhv3uwxajofet
[+] /usr/local (external z3-4.8.12-hvhfxnxuachtpi524zf55znqn55vanod)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/ncurses-6.2-xilcz3bhw4otebvysduddyldezxhxvy6
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libxml2-2.9.10-mlrnjcbnjt3w7635xrietes7terwhko6
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/perl-data-dumper-2.173-cv4kwshixb7tmk6p7icxrqpicppkx5gr
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-setuptools-50.3.2-hwyhyijgi3yjokddm67tb6aulefteudx
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/swig-4.0.2-vajpijk4isacr52dzgk2gqbvyunadwkc
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libedit-3.1-20210216-6h4xokftdnxe2h3o7tie2cnbzbhfrr4h
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/hwloc-2.5.0-z2brjfcvnend5gorjmeqqgirccqerdwd
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-six-1.15.0-c63zkkdjpvegqai2f4jjg4mutsuchoov
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/llvm-12.0.1-n6c5z7sqfo7olnaqswu7jqhcdkyyk6nh

以依赖 nvhpc 编译器的 hdf5 为例。笔者碰到的问题有给定 mpi 找不到对应 wrapper 的 nvc 以及 nvfortran。在手动编译的时候会在 PATH 里面找 cc，或者直接找 CC，FC，CXX。这时候需要定义一个 FC。

if spec.compiler.fc="nvfortran":
   env.set("FC","/path/to/mpifort wrapper")
   args.append(("CMAKE_Fortran_COMPILER",spec.complier.mpifort))

在超算上的 spack 有两个 upstream，你觉得重要可以直接给原 repo PR，一般我们备份到学校内部的 gitlab。

Spack 与 Modules

当系统有 Modules 时，会自动把 Module file 的目录加到 MANPATH 下，也即立即可以使用 module load

有关Spack错误记录

$ spack load boost@1.70
==> Error: No compilers for operating system debian10 satisfy spec gcc@10.2.0

当出现这种错误时，可以检查一下 .spack/linux/compilers.yaml 是否包含了spack中的所有compiler。

Package

原始环境中常常有各种环境变量，这些环境变量在给 Spack 打包的时候可能会有影响，因此本 SOP 提出了一种用 docker 做环境隔离的方式，以便在不影响原始环境的情况下，可以在不同的环境中运行 Spack 的打包。

How to run

$ cd <Dockerfile-Folder>
$ docker build -t <image-name> .
$ docker run -it -d -v <spack-folder-your-machine>:<spack-folder-your-machine> --name <docker-name> <image-name>
$ docker exec -it <docker-name> bash

Dockerfile ( Long-Term Support )

FROM ubuntu:20.04
RUN apt-get update
RUN apt-get install -y ca-certificates
RUN sed -i "s/archive.ubuntu.com/mirrors.shanghaitech.edu.cn/g" /etc/apt/sources.list
RUN apt-get update
RUN apt-get install -y python python3 gcc build-essential wget nano vim gfortran curl less libnl-nf-3-200

Spack 简单教程

❯ spack info gcc
AutotoolsPackage:   gcc

Description:
    The GNU Compiler Collection includes front ends for C, C++, Objective-C,
    Fortran, Ada, and Go, as well as libraries for these languages.

Homepage: https://gcc.gnu.org

Maintainers: @michaelkuhn @alalazo

Externally Detectable:
    True (version, variants)

Tags:
    None

Preferred version:
    11.2.0    https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz

Safe versions:
    master    [git] git://gcc.gnu.org/git/gcc.git on branch master
    11.2.0    https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz
    11.1.0    https://ftpmirror.gnu.org/gcc/gcc-11.1.0/gcc-11.1.0.tar.xz
    10.3.0    https://ftpmirror.gnu.org/gcc/gcc-10.3.0/gcc-10.3.0.tar.xz
    10.2.0    https://ftpmirror.gnu.org/gcc/gcc-10.2.0/gcc-10.2.0.tar.xz
    10.1.0    https://ftpmirror.gnu.org/gcc/gcc-10.1.0/gcc-10.1.0.tar.xz
    9.4.0     https://ftpmirror.gnu.org/gcc/gcc-9.4.0/gcc-9.4.0.tar.xz
    9.3.0     https://ftpmirror.gnu.org/gcc/gcc-9.3.0/gcc-9.3.0.tar.xz
    9.2.0     https://ftpmirror.gnu.org/gcc/gcc-9.2.0/gcc-9.2.0.tar.xz
    9.1.0     https://ftpmirror.gnu.org/gcc/gcc-9.1.0/gcc-9.1.0.tar.xz
    8.5.0     https://ftpmirror.gnu.org/gcc/gcc-8.5.0/gcc-8.5.0.tar.xz
    8.4.0     https://ftpmirror.gnu.org/gcc/gcc-8.4.0/gcc-8.4.0.tar.xz
    8.3.0     https://ftpmirror.gnu.org/gcc/gcc-8.3.0/gcc-8.3.0.tar.xz
    8.2.0     https://ftpmirror.gnu.org/gcc/gcc-8.2.0/gcc-8.2.0.tar.xz
    8.1.0     https://ftpmirror.gnu.org/gcc/gcc-8.1.0/gcc-8.1.0.tar.xz
    7.5.0     https://ftpmirror.gnu.org/gcc/gcc-7.5.0/gcc-7.5.0.tar.xz
    7.4.0     https://ftpmirror.gnu.org/gcc/gcc-7.4.0/gcc-7.4.0.tar.xz
    7.3.0     https://ftpmirror.gnu.org/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz
    7.2.0     https://ftpmirror.gnu.org/gcc/gcc-7.2.0/gcc-7.2.0.tar.xz
    7.1.0     https://ftpmirror.gnu.org/gcc/gcc-7.1.0/gcc-7.1.0.tar.bz2
    6.5.0     https://ftpmirror.gnu.org/gcc/gcc-6.5.0/gcc-6.5.0.tar.bz2
    6.4.0     https://ftpmirror.gnu.org/gcc/gcc-6.4.0/gcc-6.4.0.tar.bz2
    6.3.0     https://ftpmirror.gnu.org/gcc/gcc-6.3.0/gcc-6.3.0.tar.bz2
    6.2.0     https://ftpmirror.gnu.org/gcc/gcc-6.2.0/gcc-6.2.0.tar.bz2
    6.1.0     https://ftpmirror.gnu.org/gcc/gcc-6.1.0/gcc-6.1.0.tar.bz2
    5.5.0     https://ftpmirror.gnu.org/gcc/gcc-5.5.0/gcc-5.5.0.tar.bz2
    5.4.0     https://ftpmirror.gnu.org/gcc/gcc-5.4.0/gcc-5.4.0.tar.bz2
    5.3.0     https://ftpmirror.gnu.org/gcc/gcc-5.3.0/gcc-5.3.0.tar.bz2
    5.2.0     https://ftpmirror.gnu.org/gcc/gcc-5.2.0/gcc-5.2.0.tar.bz2
    5.1.0     https://ftpmirror.gnu.org/gcc/gcc-5.1.0/gcc-5.1.0.tar.bz2
    4.9.4     https://ftpmirror.gnu.org/gcc/gcc-4.9.4/gcc-4.9.4.tar.bz2
    4.9.3     https://ftpmirror.gnu.org/gcc/gcc-4.9.3/gcc-4.9.3.tar.bz2
    4.9.2     https://ftpmirror.gnu.org/gcc/gcc-4.9.2/gcc-4.9.2.tar.bz2
    4.9.1     https://ftpmirror.gnu.org/gcc/gcc-4.9.1/gcc-4.9.1.tar.bz2
    4.8.5     https://ftpmirror.gnu.org/gcc/gcc-4.8.5/gcc-4.8.5.tar.bz2
    4.8.4     https://ftpmirror.gnu.org/gcc/gcc-4.8.4/gcc-4.8.4.tar.bz2
    4.7.4     https://ftpmirror.gnu.org/gcc/gcc-4.7.4/gcc-4.7.4.tar.bz2
    4.6.4     https://ftpmirror.gnu.org/gcc/gcc-4.6.4/gcc-4.6.4.tar.bz2
    4.5.4     https://ftpmirror.gnu.org/gcc/gcc-4.5.4/gcc-4.5.4.tar.bz2

Variants:
    Name [Default]               Allowed values          Description
    =========================    ====================    ===================================================

    binutils [off]               on, off                 Build via binutils
    bootstrap [on]               on, off                 Enable 3-stage bootstrap
    graphite [off]               on, off                 Enable Graphite loop optimizations (requires ISL)
    languages [c,c++,fortran]    ada, brig, c, c++,      Compilers and runtime libraries to build
                                 fortran, go, java,
                                 jit, lto, objc,
                                 obj-c++
    nvptx [off]                  on, off                 Target nvptx offloading to NVIDIA GPUs
    piclibs [off]                on, off                 Build PIC versions of libgfortran.a and libstdc++.a
    strip [off]                  on, off                 Strip executables to reduce installation size

Installation Phases:
    autoreconf    configure    build    install

Build Dependencies:
    binutils  cuda  diffutils  flex  gmp  gnat  iconv  isl  mpc  mpfr  zip  zlib  zstd

Link Dependencies:
    binutils  cuda  gmp  gnat  iconv  isl  mpc  mpfr  zlib  zstd

Run Dependencies:
    binutils

Virtual Packages:
    gcc@7: languages=go provides golang@:1.8
    gcc@6: languages=go provides golang@:1.6.1
    gcc@5: languages=go provides golang@:1.4
    gcc@4.9: languages=go provides golang@:1.2
    gcc@4.8.2: languages=go provides golang@:1.1.2
    gcc@4.8: languages=go provides golang@:1.1
    gcc@4.7.1: languages=go provides golang@:1
    gcc@4.6: languages=go provides golang

得到相关依赖，可以查看你现在如果安装 spack 提供的依赖

❯ spack spec gcc  
Input spec
--------------------------------
gcc

Concretized
--------------------------------
gcc@11.2.0%apple-clang@12.0.5~binutils+bootstrap~graphite~nvptx~piclibs~strip languages=c,c++,fortran patches=ecc5ac43951b34cbc5db15f585b4e704c42e2e487f9ed4c24fadef3f3857930b arch=darwin-bigsur-skylake
    ^diffutils@2.8.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake
    ^gmp@6.2.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^autoconf@2.71%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^automake@1.16.4%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^libtool@2.4.6%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^m4@1.4.6%apple-clang@12.0.5+sigsegv patches=c0a408fbffb7255fcc75e26bd8edab116fc81d216bfd18b473668b7739a4158e arch=darwin-bigsur-skylake
    ^libiconv@1.16%apple-clang@12.0.5 arch=darwin-bigsur-skylake
    ^mpc@1.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake
        ^mpfr@4.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake
            ^autoconf-archive@2019.01.06%apple-clang@12.0.5 arch=darwin-bigsur-skylake
            ^texinfo@4.8%apple-clang@12.0.5 arch=darwin-bigsur-skylake
    ^zlib@1.2.11%apple-clang@12.0.5+optimize+pic+shared arch=darwin-bigsur-skylake
    ^zstd@1.5.0%apple-clang@12.0.5~ipo~legacy~lz4~lzma~multithread+programs+shared+static~zlib build_type=RelWithDebInfo arch=darwin-bigsur-skylake
        ^cmake@3.21.1%apple-clang@12.0.5~doc+ncurses+openssl+ownlibs~qt build_type=Release arch=darwin-bigsur-skylake

如果想使用特定依赖或者依赖系统的包，会在 ~/.spack/package 下得到，其使用方法可见 AMD

❯ spack external find

❯ cat ~/.spack/package.json
       │ File: /Users/victoryang/.spack/packages.yaml
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ packages:
   2   │   autoconf:
   3   │     externals:
   4   │     - spec: autoconf@2.71
   5   │       prefix: /usr/local
   6   │   automake:
   7   │     externals:
   8   │     - spec: automake@1.16.4
   9   │       prefix: /usr/local
  10   │   bash:
  11   │     externals:
  12   │     - spec: bash@3.2.57
  13   │       prefix: /
  14   │   bazel:
  15   │     externals:
  16   │     - spec: bazel@4.1.0
  17   │       prefix: /usr/local
  18   │   bison:
  19   │     externals:
  20   │     - spec: bison@2.3
  21   │       prefix: /usr
  22   │   bzip2:
  23   │     externals:
  24   │     - spec: bzip2@1.0.6
  25   │       prefix: /usr
  26   │   cmake:
  27   │     externals:
  28   │     - spec: cmake@3.21.1
  29   │       prefix: /usr/local
  30   │   diffutils:
  31   │     externals:
  32   │     - spec: diffutils@2.8.1
  33   │       prefix: /usr
...

❯ spack install -j 8 --no-checksum llvm+mlir+flang+all_targets+python+shared_libs cflags="-O3" cxxflags="-O3"
[+] /usr/local (external cmake-3.21.1-cdhzbrts4k5ylrvlpspfl75zgeht4swi)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libiconv-1.16-ropgshv657ooz7kfzojv4s6srscgimnw
[+] /usr/local (external pkg-config-0.29.2-4nv7fo7lbjybt2u3xzb2vxzvgvaz5xmw)
[+] /usr/local (external xz-5.2.5-p37wr6fna4ysoh2xn2wnmmzttm3bi37o)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/zlib-1.2.11-lci2s4zd6x77rmexa3uuarbl5cvneskw
[+] /usr (external perl-5.30.2-4zkfgqml35km4ly7xmxn7ooz44dxtgqp)
[+] /usr/local (external python-3.9.6-shbb7dthsqe4lu26jugddyi2k7pl3jbl)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/pcre-8.44-g4df4jqpudoxhjsrubrqhv3uwxajofet
[+] /usr/local (external z3-4.8.12-hvhfxnxuachtpi524zf55znqn55vanod)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/ncurses-6.2-xilcz3bhw4otebvysduddyldezxhxvy6
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libxml2-2.9.10-mlrnjcbnjt3w7635xrietes7terwhko6
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/perl-data-dumper-2.173-cv4kwshixb7tmk6p7icxrqpicppkx5gr
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-setuptools-50.3.2-hwyhyijgi3yjokddm67tb6aulefteudx
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/swig-4.0.2-vajpijk4isacr52dzgk2gqbvyunadwkc
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libedit-3.1-20210216-6h4xokftdnxe2h3o7tie2cnbzbhfrr4h
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/hwloc-2.5.0-z2brjfcvnend5gorjmeqqgirccqerdwd
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-six-1.15.0-c63zkkdjpvegqai2f4jjg4mutsuchoov
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/llvm-12.0.1-n6c5z7sqfo7olnaqswu7jqhcdkyyk6nh

if spec.compiler.fc="nvfortran":
   env.set("FC","/path/to/mpifort wrapper")
   args.append(("CMAKE_Fortran_COMPILER",spec.complier.mpifort))

在超算上的 spack 有两个 upstream，你觉得重要可以直接给原 repo PR，一般我们备份到学校内部的 gitlab。

Spack 与 Modules

当系统有 Modules 时，会自动把 Module file 的目录加到 MANPATH 下，也即立即可以使用 module load

有关Spack错误记录

$ spack load boost@1.70
==> Error: No compilers for operating system debian10 satisfy spec gcc@10.2.0

当出现这种错误时，可以检查一下 .spack/linux/compilers.yaml 是否包含了spack中的所有compiler。

Debug a package

spack cd <spec>
spack build-env <spec> <shell/command>

Architecture

这里放置有关计算机体系结构的资料

Memory Model

Memory Coherence

Memory coherence: a memory system is coherent if any read of a data item returns the most recently written value of that data item.

Coherent memory system:

For a memory address wrote by a processor P, the next reads of P should get the written value.
For a memory address wrote by a processor P1, after enough time, another processor P2 can get the value written by P1.
The write operations for one memory address are serialized, so if there are 2 writes to a memory address by any processor, any processor cannot get two results in different order.

The coherence model does not define when a wrote by P1 can be read by P2. The memory consistency model is responsible for it.

Memory Consistency

Memory consistency: A memory consistency model for a shared address space specifies constraints on the order in which memory operations must appear to be performed (i.e. to become visible to the processors) with respect to one another.(when a written value will be returned/seen by a read).

The memory consistency model defines the order of operation pairs in different addresses.

Sequential consistency model

In each processor, the read operation should always get the value of the last write operation in program order.

# Processor 1
Flag1 = 1
if (Flag2 == 0)
	do sth

# Processor 2
Flag2 = 1
if (Flag1 == 0)
	do sth

For P1, SC can guarantee that if the value of Flag2 is 0, the write of Flag1 happens before the write and read of P2. So there is at most one processor is in the do sth section (Neither processor failed to get in the critical section is also possible).

There is only one order visible to all processors. For two write operations W1, W2 (can be done by different processors), each processors should get the same sequence.

# Processor 1
A = 1

# Processor 2
if (A == 1)
    B = 1

# Processor 3
if (B == 1)
    get(A)

If P3 gets the B == 1, the value of A must be 1. Since the write sequence seen by P2 is A = 1 -> B = 1.

Sequential consistency can produce non-deterministic results. This is because the sequence of sequential operations between processors can be different during different runs of the program. All memory operations need to happen in the program order.

Relaxed memory consistency models

Suppose A->B means for one processor, the operation A is done before operation B.

If W->R is violated, the order is Total Store Ordering. It is used by x86-64 architecture.

If W->W is violated, the order is Partial Store Ordering.

...

More memory models can be seen from

https://en.wikipedia.org/wiki/Consistency_model

Synchronized-with and happens-before

The synchronized-with relationship is something that you can get only between suitably tagged (the default is memory_order_seq_cst, which is a suitable tag) operations on atomic types (data structures like mutex contains these atomic types). If A writes on x and B reads on x, there is a synchronizes-with relationship between A and B.

The happens-before relationship specifies which operations see the effects of which other operations. For a single thread, the happens-before relationship can be easily determined by the program order. For multi-threading, if operation A on one thread inter-thread happens-before operation B on another thread, then A happens-before B. The inter-thread happens-before relies on the synchronizes-with relationship. If operation A in one thread synchronizes-with operation B in another thread, then A inter-thread happens-before B. This relationship is transitive.

These rules means if you make changes in one thread, you need only one synchronizes-with relationship for the data to be visible to subsequent operations on other threads.

C++ Memory Order

C++ has 6 memory ordering options on atomic types.

momory_order_relaxed,
memory_order_consume,
memory_order_acquire,
memory_order_release,
memory_order_acq_rel,
memory_order_seq_cst.

They represents 3 memory models:

Sequential Consistency (memory_order_seq_cst)
Relaxed (memory_order_relaxed)
Acquire-Release (memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel)

For x86-64 architecture, the acquire-release ordering do not require additional instructions. Sequential consistent ordering has small additional cost on store operations. But it will also influence the instruction reordering of compiler, so they all have potential cost except memory_order_relaxed.

In non-sequentially consistent memory orderings, threads don’t have to agree on the order of events on atomic variables. In the absence of other ordering constraints, the only requirement is that all threads agree on the modification order of each individual variable.

std::memory_order_seq_cst

If all operations on instances of atomic types are sequentially consistent, the behavior of a multithreaded program is as if all these operations were performed in some particular sequence by a single thread. This is by far the easiest memory ordering to understand, which is why it’s the default: all threads must see the same order of operations. ... It also means that operations can’t be reordered; if your code has one operation before another in one thread, that ordering must be seen by all other threads.

A sequentially consistent store synchronizes-with a sequentially consistent load of the same variable that reads the value stored.

-- C++ concurrency in action 2nd edition, P124

std::memory_order_relaxed

Operations on atomic types performed with relaxed ordering don’t participate in synchronizes-with relationships. Operations on the same variable within a single thread still obey happens-before relationships, but there’s almost no requirement on ordering relative to other threads.

-- C++ concurrency in action 2nd edition, P127

# Processor 1
x.store(true, std::memory_order_relaxed);
y.store(true, std::memory_order_relaxed);

# Processor 2
while (!y.load(std::memory_order_relaxed));
if (x.load(std::memory_order_relaxed)) ++z;

Here z can be 0. Since there are no ordering guarantees relating to the visibility to different threads.

The relaxed ordering gives a well-defined behavior in multi-threading compared to volatile keyword or only uses a normal variable. Since the sematic is atomic, the fetch_add method is also atomic, which means you can use it as a counter. In x86, the fetch_add method with std::memory_order_relaxed is implemented as lock xadd, which is same as using std::memory_order_seq_cst (but the former one can be reordered by the compiler, and in other architectures, the implementation may be different).

Problem may encounter in using relaxed ordering: OOTA

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1217r2.html

std::memory_order_acquire && std::memory_order_release

std::memory_order_release only used in store(), std::memory_order_acquire only used in load(). The operation pair can form a synchronizes-with relationship.

Any writes or reads before store() should not be moved after it.
Any writes or reads after load() should not be moved before it.

Typical usage:

# Processor 1
data = 100;                                       // A
ready.store(true, std::memory_order_release);     // B

# Processor 2
while (!ready.load(std::memory_order_acquire))    // C
    ;
assert(data == 100); // never failed              // D

TODO: std::memory_order_consume

Double-checking locking

if (!x_init.load(memory_order_acquire)) {
    lock_guard<mutex> _(x_init_mutex);
    if (!x_init.load(memory_order_relaxed)) { // <- Already hold the lock!
        initialize x;
        x_init.store(true, memory_order_release);
    }
}

Initial load for compare-exchange

unsigned long expected = x.load(memory_order_relaxed); 
// <- result does not affect correctness since the CAS will check again. 
while (!x.compare_exchange_weak(expected, f(expected))) {}

References:

SVE

Scalable vector length increasing parallelism while allowing implementation choice.
Rich addressing modes enabling non-linear data accesses.
Per-lane predication allowing vectorization of loops containing complex control flow.
Predicate-driven loop control and management reduces vectorization overhead relative to scalar code. A rich set of horizontal operations applicable to more types of reducible loop-carried dependencies.
Vector partitioning and software-managed speculation enabling vectorization of loops with datadependent exits.
Scalarized intra-vector sub-loops permitting vectorization of loops with more complex loop-carried dependencies.

Use predicates to predict the scalable registers

This state provides thirty-two new scalable vector registers(Z0–Z31). Their width is implementation dependent withinthe aforementioned range. The new registers extend thethirty-two 128-bit wide Advanced SIMD registers (V0–V31)to provide scalable containers for 64-, 32-, 16-, and 8-bit data elements.

Arm 64FX implementation

References

https://github.com/fujitsu/A64FX/tree/master/doc
https://www.youtube.com/watch?v=Qma7UuYifhM
https://www.youtube.com/watch?v=3TYVqodc8w4
https://www.youtube.com/watch?v=H3COrJQxBkQ

\(\alpha\)	A	B	\(\beta\)	C
D	S	S	D	D
D	S	D	D	D
D	D	S	D	D
Z	C	C	Z	Z
Z	C	Z	Z	Z
Z	Z	C	Z	Z

\(\alpha\)	A	B	\(\beta\)	C
C	S	S	C	C
C	S	C	C	C
C	C	S	C	C
Z	D	D	Z	Z
Z	D	Z	Z	Z
Z	Z	D	Z	Z

GeekPie_HPC Wiki