Introduction
首先,欢迎大家加入GeekPie_HPC
,这是一个技术中立,立志和清华比肩的比赛队伍。我们重视输赢的同时,更着重培养的是大家的实践(对各种工具的熟练应用)能力,交流(擅长画饼,接锅和再造血)能力。如果你需要花绝大多数的时间卷 GPA,本社不欢迎你,由于比赛几乎所有时间都落在期中期末考试左右,我们希望的是 YOLO 的精神。
有关如何做一个好的技术分享官,无论是学术、科研,都是在和别人 brain storm 的同时产生价值。你做的工作,只有对别人产生了价值,别人才会 value 你,光做一个技术强者是毫无作用的。希望大家珍惜与优秀的人共事的机会,在 slack周会上看看别人是怎么做的,然后自己也贡献些力所能及的事。
这里是 GeekPie_HPC
托管在 github pages
的第三个 wiki,有一部分在 geekpie 的 wiki.js
里,还有一小部分在 geekpie 校内服务器的conference 上,为了避免有一天运维删库跑路,所以先放 github 上。
本 Wiki 的同时生成静态和动态页面
- 静态: 使用 GitHub Actions +
mdbook
生成。 https://hpc.geekpie.club/wiki/ - 动态: 使用
wiki.js
生成,支持实时编辑,需要学校内网。 https://wiki.geekpie.club/
添加文件
请直接在 main
branch上提交 markdown
文件,在半分钟之后 wiki 就会得到更新。
如果新建了文件,需要同步更新根目录下的 SUMMARY.md
文件。
文件命名采用小端大写法。
权限申请
-
请向 murez 申请 Git 仓库的编辑权限,可以在 GeekPie 科创工作室 或 Slack 找到。
-
在学校内网,使用上科大邮箱可直接注册
wiki.js
。如有困难,请到 Slack #general 频道寻求支援。
有关出入口
招新公告,slack用上科大邮箱可以注册,暑期想实习、磕盐可以在黑裙里找到相关信息。校内不定期邀请外校同学和同事和共事者演讲。
Generated GitHub Pages is powered by mdBook.
Algorithm
这里放置有关应用和测试中常见的算法。
DGemm
An problem to resolve widely used in Convolution, HPL, HPCG.
https://zhuanlan.zhihu.com/p/464740681
SPMV
Numerical linear algebra is a basic calculation module in scientific computing. The calculation of solving linear equations, linear least squares problems, eigenvalues and singular values is the most computationally intensive part of scientific computing. With the advent of numerical programming, it is very effective to use complex subroutine libraries to solve such problems. When we write program code that involves linear algebra operations, we usually decompose calculations into basic subroutines such as point multiplication or matrix vector multiplication. As a result, structured programming came into being, specifying basic building blocks and using unique mnemonic names to identify these operations, and in order to improve the efficiency of the algorithmic use of these algebra programs, the names and parameter lists of some of the basic operations were uniformly planned.
From 1973 to 1977, the first "level" Basic Linear Algebra Subprograms (BLAS) identified some kernel operations, mainly Fortran specifications and implementations of subroutines for scalar and vector operations [1]. With the advent of vectors, hierarchical storage, and parallel shared memory machines, 1984-1988 successively improved the second "level" BLAS of matrix vector operations and the third "level" BLAS [2,3] of operations between matrices. Specification. The three "levels" of BLAS are not only the division of its development process, but also a measure of the complexity of the algorithm [4]. In order to further develop the BLAS library, a BLAS technical forum meeting was started at the University of Tennessee symposium in 1995 to discuss the overall functions of BLAS, sparse BLAS, dense BLAS with distributed memory, extended BLAS calculation accuracy, and mixed BLAS calculation accuracy, interval BLAS and extensions to existing BLAS.
With the continuous development of the BLAS benchmark library, it has been able to be applied to many hardware platforms and serves programs related to numerical calculation in various industries. Among them, GeneralMatrix-matrixMultiplication (GEMM) is scientific computing (high-performance computing, machine learning). For basic operations in engineering and data applications, people have been aiming for different computing platforms, trying to find corresponding optimization methods to make their calculations faster.
Problem description
For decades, General Matrix-Matrix Multiplication (GEMM) has been the standard benchmark for computing performance. GEMM is the most commonly used type of computing model in high-performance computing. Whether in the HPC field, such as FFT, convolution, correlation, filtering, etc., or in the field of DeepLearning, such as convolution layers, fully connected layers, etc., its core algorithms can be directly or indirectly converted into matrix multiplication operations. The GEMM calculation formula is as follows:
\[ C \leftarrow \alpha \ op(A)\ op(B) + \beta \ C \]
Among them, \(op(X)\) represents matrix X, or transposed \(X^{T}\) of matrix X, or conjugate transposed \(X^{H}\) of matrix X, α and β are scalars, matrix A, m rows and k columns, matrix B, k rows and n columns, Matrix C, m rows and n columns.
There are two possible combinations of different numerical types and precisions in mixed precision:
- All scalar parameters and output parameters (scalar or array) are double precision, and at least one array is single precision. Then the type of combination is as follows: (S = Singlereal, D = Doublereal, C = Singlecomplex, Z = Doublecomplex)
\(\alpha\) | A | B | \(\beta\) | C |
---|---|---|---|---|
D | S | S | D | D |
D | S | D | D | D |
D | D | S | D | D |
Z | C | C | Z | Z |
Z | C | Z | Z | Z |
Z | Z | C | Z | Z |
- The precision of all floating-point parameters must be all single precision or all double precision. All scalar parameters and output parameters (scalars or arrays) are complex numbers (unless due to mathematical calculations, all scalar parameters must be real numbers, such as the sum in HERK). The types of combination are as follows:
\(\alpha\) | A | B | \(\beta\) | C |
---|---|---|---|---|
C | S | S | C | C |
C | S | C | C | C |
C | C | S | C | C |
Z | D | D | Z | Z |
Z | D | Z | Z | Z |
Z | Z | D | Z | Z |
BLAS implementations are usually optimized for calculation speed for specific machines, so using them can bring significant performance advantages. This competition focuses on the calculation performance of the single-precision real number matrix (SGEMM) on domestic advanced computing platforms. Players can refer to the rocBLAS function library [5, 6] for understanding of the relevant content. The API functions implemented by BatchSgemm in the rocblas function library For: rocblas_sgemm_strided_batched.
As the amount of data continues to increase, the size of matrices that need to be calculated in numerical calculations increases. People have proposed to use multiple batches to accelerate matrix multiplication calculations, because multi-batch matrix multiplication can better utilize the computing resources of hardware calculation accelerators. The sub-matrix in each batch in the calculation has a stride address offset and has the same size. Calculated as follows:
\[ C[i * stride _{c}] \leftarrow \alpha * o p(A[i * stride _{a}]) * op (B[i * stride _{b}])+\beta * C[i * stride _{c}] i \in[0. ,\text{batch count} - 1] \]
In order to further improve the calculation efficiency of the matrix, batch and strided calculation strategies are introduced on the basis of the original matrix multiplication. In order to make full use of the performance advantages of the GPU-like heterogeneous accelerator used in the cluster in this calculation method, the function implementation needs to be further optimized.
Test Methods
The example function to implement strided batched matrix-matrix operations is as follows. In order to facilitate the better optimization of the players, the function and its parameters are explained as follows:
sgemm strided batched( sgemm operation trans a,
sgemm operation trans b,
intm,
intn,
int k,
const float* alpha
const float A,
int lda,
int stride a,
const float* B
int ldb,
int stride b,
const float* beta,
float C,
int ldc,
int stride C
int batch count)
typedef enum sgemm operation_ {
operation none = 0,
operation transpose = 1,
operation conjugate transpose = 2
} sgemm operation;
Input parameters:
Parameter trans_a: sgemm_operation type. Details the format of op (A) used in matrix multiplication
If trans_a = operation_none, then op (A) = A;
If trans_a = operation_transpose, then op (A) = A ';
If trans_a = operation_conjugate_transpose, then op (A) = conjg (A ').
Parameter trans_b: sgemm_operation type. The definition is the same as trans_a;
Parameter m: the number of rows of matrix A, m> 0;
Parameter n: the number of columns in matrix B, n> 0;
Parameter k: the number of columns of matrix A and the number of rows of matrix B, K> 0;
Parameter alpha: is a single-precision real number, representing the scalar coefficient of matrix A;
Parameter A: indicates that the matrix stored in the form of a pointer on the GPU is a single-precision real number matrix A;
Parameter Ida: refers to the size of the first dimension of matrix A when it is actually stored, that is, if the matrix is stored first by row, then Ida K; if it is stored first by column, Ida ≯ M
Parameter stride_a: represents the span from the A matrix to the next matrix;
Parameter B: indicates that the matrix stored in the form of a pointer on the GPU is a single-precision real number matrix B;
Parameter Idb: refers to the size of the first dimension of matrix B in actual storage, and the meaning of details is the same as lda; stride_b represents the span from the start of matrix B to the next matrix;
Parameter beta: a single-precision real number representing the scalar coefficient of matrix C. If beta = 0, you do not need to define matrix C;
Parameter C: indicates that the matrix C stores the elements in the form of pointers as single-precision real numbers;
Parameter Idc: refers to the size of the first dimension of the matrix C in actual storage, with the same details as lda;
Parameter stride_c: represents the span from the start of the C matrix to the next matrix;
Parameter batch_count: indicates the number of sgemm operations in the same batch.
Output parameters
Parameter C: C matrix covering the input.
Claim:
-
The player performs matrix multiplication performance optimization based on the given interface function. In a given test code, the API interface function cannot be changed. The non-fixed parameter players involved can be tuned by themselves according to the matrix size involved in the calculation;
-
In the test sample section, the code gives a pseudo-random number generated dense matrix as test data to verify the performance of the algorithm. The main test case sizes are the following three:
M | N | K | Batch | |
---|---|---|---|---|
1 | 64 | 64 | 32 | 30 |
2 | 128 | 128 | 64 | 20 |
3 | 128 | 512 | 256 | 10 |
- The competitor needs to submit the function to implement part of the code, the compilation method of the function library, and the generated dynamic link library, test samples, test process, and the encrypted sequence of the running results.
Note: The contestants use the test script provided by the contest group as the basis to improve the function implementation part. To avoid affecting the contestants' performance, please compile and execute the time-encrypted serial code generated by calculating the corresponding matrix to the designated location on the web page.
SGEMM Kernel Optimization on VEGA
https://github.com/victoryang00/SGEMM_on_VEGA
Reference
- C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. ACM Trans.Math.Software,5:308-323,1979.
- J. J. Dongarra, J. Du Croz, I. S. Duff, and D. Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math.Software,16:1-28,1990.
- J. J. Dongarra, J. Du Croz, D. Hammarling, R. J. Hanson. An extended set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Software, 14:1-32, 399, 1988.
- https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
- https://i-techx.github.io/iTechX/courses?course_code=CS121
FFT
An algorithm widely used in PDE solutions/ numerical emulation.
Single thread
Parallel FFT
ScalaFFT
DFT
References
- https://i-techx.github.io/iTechX/courses?course_code=CS121
MHM2 Adjusting k-mers
A method of visualizing k-mers, the k-mer spectrum, shows the multiplicity of each k-mer in a sequence versus the number of k-mers with that multiplicity. It requires a DHT to store the sequence.
The default parameters are good enough for the dataset in the competition.
-
Modifying those parameters will influence accuracy.
- Adding an iteration will slightly increase the # of long sequences, the result still stays in acceptable range, but the speed is about 1/7 slower than original.
- Removing an iteration will greatly increase speed (about 1/7 compared to original), but the result will differ dramatically from reference.
- Adjusting the values of k will not make MHM2 much faster/slower, and the result would still be acceptable if k is not changed dramatically.
-
From the paper, we learn that the preset k is good enough for most of the cases
- Too large k is not fair to low-coverage genomes
- Too small k may not be able to detect errors produced by the sequencer.
Applications
这里放置GeekPie_HPC
参与过各个竞赛中的应用的经历和经验。
CESM
Build & Running
OneKeyConf
./create_newcase -res 0.47x0.63_gx1v6 -compset B -case ../EXP2 -mach pleiades-ivy
mkdir nobackup
ln -s /home/cesm/data/inputdata_EXP1/ nobackup/inputdata
# EXP1: ./xmlchange -file env_run.xml -id DOCN_SOM_FILENAME -val pop_frc.gx1v6.091112.nc
./xmlchange -file env_build.xml -id CESMSCRATCHROOT -val `pwd`'/nobackup/$USER'
./xmlchange -file env_build.xml -id EXEROOT -val `pwd`'/nobackup/$CCSMUSER/$CASE/bld'
./xmlchange -file env_run.xml -id RUNDIR -val `pwd`'/nobackup/$CCSMUSER/$CASE/run'
./xmlchange -file env_run.xml -id DIN_LOC_ROOT -val `pwd`'/nobackup/inputdata'
./xmlchange -file env_run.xml -id DIN_LOC_ROOT_CLMFORC -val `pwd`'/nobackup/inputdata/atm/datm7'
./xmlchange -file env_run.xml -id DOUT_S_ROOT -val `pwd`'/nobackup/$CCSMUSER/archive/$CASE'
./xmlchange -file env_run.xml -id RUN_STARTDATE -val 2000-01-01
./xmlchange -file env_build.xml -id BUILD_THREADED -val TRUE
# edit Macro SLIBS -lnetcdff
# edit env_mach_specific
./cesm_setup
ybs.sh
./EXP2.clean_build all
./cesm_setup -clean
rm -rf $build_dir
./cesm_setup
./EXP2.build
PBS
##PBS -N dappur
##PBS -q pub_blad_2
##PBS -j oe
##PBS -l walltime=00:01:00
##PBS -l nodes=1:ppn=28
Performance Tuning
Trouble Shooting
High sys percentage in top (>20%)
This is apparent this is a communication problem. Try switching to Intel MPI for a terribly low sys percentage (<1%).
ERROR: remap transport: bad departure points
Warning: Departure points out of bounds in remap
my_task, i, j = 182 4 8
dpx, dpy = -5925130.21408796 -0.368922055964299
HTN(i,j), HTN(i+1,j) = 72848.1354852604 72848.1354852604
HTE(i,j), HTE(i,j+1) = 59395.4550164223 59395.4550164223
istep1, my_task, iblk = 1095001 182 1
Global block: 205
Global i and j: 35 47
(shr_sys_abort) ERROR: remap transport: bad departure points
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 182
This error may due to multiple reasons.
One significant one is the bad grid division. We were once using one PE for every processor core so the total number of PEs is not a power of 2. Then we used 128 (or later 256) and the error diminished until it showed up again after 6mos of simulation...
Then another affecting reason is the parameter xndt_dyn, see link. This parameter has already been set to 2 after solving the last problem (originally 1). Then we tried increasing this parameter again, it passed the 6mos simulation, but crashed again after another 3mos. We then continued increasing the value, but it crashes faster. We stopped at about 20mos simulation and turned to GNU compiler version with Intel MPI.
However, this does not mean it's the fault of Intel compiler. Direct comparison between Intel and GNU compilers is unfair because the combination of Intel compiler xndt_dyn=1 and most importantly the correct PE number has not been tried. Maybe try using xndt_dyn=1 from be beginning next time, using Intel compiler.
OpenMP failed
Still no solved, but very promising for improving performance.
fixed in WRF
quest analysis
program goal analysis
what's code is actually doing is to simulate quantum computing.
different bits state - qubits
3 states: 1
0
0/1
store by qreal which is actualy a complex number a+bi (a+b=1), and it can be stated as \( (\begin{smallmatrix}0.123124&0\ 0&0.876876\end{smallmatrix}) \) , also note that because gpu only support float32 computing. So native qreal (precision=4) is not supported in gpu simutation.
/*
* Single precision, which uses 4 bytes per amplitude component
*/
# if QuEST_PREC==1
# define qreal float
// \cond HIDDEN_SYMBOLS
# define MPI_QuEST_REAL MPI_FLOAT
# define MPI_MAX_AMPS_IN_MSG (1LL<<29) // must be 2^int
# define REAL_STRING_FORMAT "%.8f"
# define REAL_QASM_FORMAT "%.8g"
# define REAL_EPS 1e-5
# define absReal(X) fabs(X) // not fabsf(X) - better to return doubles where possible
// \endcond
/*
* Double precision, which uses 8 bytes per amplitude component
*/
# elif QuEST_PREC==2
# define qreal double
// \cond HIDDEN_SYMBOLS
# define MPI_QuEST_REAL MPI_DOUBLE
# define MPI_MAX_AMPS_IN_MSG (1LL<<28) // must be 2^int
# define REAL_STRING_FORMAT "%.14f"
# define REAL_QASM_FORMAT "%.14g"
# define REAL_EPS 1e-13
# define absReal(X) fabs(X)
// \endcond
/*
* Quad precision, which uses 16 bytes per amplitude component.
* This is not compatible with most GPUs.
*/
# elif QuEST_PREC==4
# define qreal long double
// \cond HIDDEN_SYMBOLS
# define MPI_QuEST_REAL MPI_LONG_DOUBLE
# define MPI_MAX_AMPS_IN_MSG (1LL<<27) // must be 2^int
# define REAL_STRING_FORMAT "%.17Lf"
# define REAL_QASM_FORMAT "%.17Lg"
# define REAL_EPS 1e-14
# define absReal(X) fabsl(X)
// \endcond
# endif
many matrices computation
all of the gate corresponds to one of the manipulation on qubits.
Basic operation on a and b https://arxiv.org/pdf/quant-ph/0207118.pdf
random variables = density matrix
hermitian:\(\rho^{\dagger}=\rho\)
positive semidefinite: eigenvalue \(\geq\) 0
trace: \(\Sigma(diagnal\ elements)=1\)
dirac notation: ket \(v_{\phi}=|\phi\rangle=\left(\begin{array}{l}\phi_{0} \\phi_{1}\end{array}\right)\)
bra \( v_{\phi}^{\dagger}=\langle\phi|=\left(\begin{array}{ll}\phi_{0} & \phi_{1}\end{array}\right)\)
\(\langle\phi \mid \psi\rangle\)= inner products of bra(fi) and ket(theta). notice: \(\langle\phi \mid \phi\rangle=1\)
\(|\phi\rangle|\psi\rangle\)=tensor product of ket(fi) and bra(theta)
2 special notation: \(u_{0}=|0\rangle=\left(\begin{array}{l}1 \ 0\end{array}\right) \quad v_{1}=|1\rangle=\left(\begin{array}{l}0 \ 1\end{array}\right)\)
the dense matrix:\(\rho=\left(\begin{array}{cc}q_{0} & 0 \ 0 & q_{1}\end{array}\right)\) (\(q_{0}+q_{1}=1\), the purpose of the equation is to illustrate the complex number ) can be stated as \(\rho=q_{0}|0\rangle\left\langle 0\left|+q_{1}\right| 1\right\rangle\langle 1|\)
so \(\rho|0\rangle=\left(q_{0}|0\rangle\left\langle 0\left|+q_{1}\right| 1\right\rangle\langle 1|\right)|0\rangle=q_{0}|0\rangle\)
dot product (from normal bits to qubits):\( |a b\rangle=|a\rangle \otimes|b\rangle=v_{00}|00\rangle+v_{01}|01\rangle+v_{10}|10\rangle \dashv v_{11}|11\rangle \rightarrow\left[\begin{array}{l}v_{00} \ v_{01} \ v_{10} \ v_{11}\end{array}\right] \)
for example in bits 5 = 101b, while in qubits \(|5\rangle_{3}=|101\rangle=|1\rangle|0\rangle|1\rangle=\left(\begin{array}{l}0 \ 1\end{array}\right)\left(\begin{array}{l}1 \ 0\end{array}\right)\left(\begin{array}{l}0 \ 1\end{array}\right)=\left(\begin{array}{l}0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 0 \ 0\end{array}\right)\)
Hadamard gate operations
\begin{aligned}H(|0\rangle) &=\frac{1}{\sqrt{2}}|0\rangle+\frac{1}{\sqrt{2}}|1\rangle=:|+\rangle \end{aligned}
\begin{aligned} H(|1\rangle) &=\frac{1}{\sqrt{2}}|0\rangle-\frac{1}{\sqrt{2}}|1\rangle=:|-\rangle \end{aligned}
\begin{aligned} H\left(\frac{1}{\sqrt{2}}|0\rangle+\frac{1}{\sqrt{2}}|1\rangle\right) &=\frac{1}{2}(|0\rangle+|1\rangle)+\frac{1}{2}(|0\rangle-|1\rangle)=|0\rangle \end{aligned}
\begin{aligned} H\left(\frac{1}{\sqrt{2}}|0\rangle-\frac{1}{\sqrt{2}}|1\rangle\right) &=\frac{1}{2}(|0\rangle+|1\rangle)-\frac{1}{2}(|0\rangle-|1\rangle)=|1\rangle\end{aligned}
corresponding matrix operation in dirac notation: \(H_{1}=\frac{1}{\sqrt{2}}\left(\begin{array}{cc}1 & 1 \ 1 & -1\end{array}\right)\)
some specialty:
- \(H=\frac{|0\rangle+|1\rangle}{\sqrt{2}}\langle 0|+\frac{|0\rangle-|1\rangle}{\sqrt{2}}\langle 1|\)
- Since \(HH^{\dagger}=I\) where I is the identity matrix, H is a unitary matrix (like all other quantum logical gates). Also, it is its own unitary inverse, \(H=H^{\dagger}\).
One application of the Hadamard gate to either a 0 or 1 qubit will produce a quantum state that, if observed, will be a 0 or 1 with equal probability (as seen in the first two operations). This is exactly like flipping a fair coin in the standard probabilistic model of computation. However, if the Hadamard gate is applied twice in succession (as is effectively being done in the last two operations), then the final state is always the same as the initial state.
__global__ void statevec_hadamardKernel (Qureg qureg, const int targetQubit){
// ----- sizes
long long int sizeBlock, // size of blocks
sizeHalfBlock; // size of blocks halved
// ----- indices
long long int thisBlock, // current block
indexUp,indexLo; // current index and corresponding index in lower half block
// ----- temp variables
qreal stateRealUp,stateRealLo, // storage for previous state values
stateImagUp,stateImagLo; // (used in updates)
// ----- temp variables
long long int thisTask; // task based approach for expose loop with small granularity
const long long int numTasks=qureg.numAmpsPerChunk>>1;
sizeHalfBlock = 1LL << targetQubit; // size of blocks halved
sizeBlock = 2LL * sizeHalfBlock; // size of blocks
// ---------------------------------------------------------------- //
// rotate //
// ---------------------------------------------------------------- //
//! fix -- no necessary for GPU version
qreal *stateVecReal = qureg.deviceStateVec.real;
qreal *stateVecImag = qureg.deviceStateVec.imag;
qreal recRoot2 = 1.0/sqrt(2.0);
thisTask = blockIdx.x*blockDim.x + threadIdx.x;
if (thisTask>=numTasks) return;
thisBlock = thisTask / sizeHalfBlock;
indexUp = thisBlock*sizeBlock + thisTask%sizeHalfBlock;
indexLo = indexUp + sizeHalfBlock;
// store current state vector values in temp variables
stateRealUp = stateVecReal[indexUp];
stateImagUp = stateVecImag[indexUp];
stateRealLo = stateVecReal[indexLo];
stateImagLo = stateVecImag[indexLo];
stateVecReal[indexUp] = recRoot2*(stateRealUp + stateRealLo);
stateVecImag[indexUp] = recRoot2*(stateImagUp + stateImagLo);
stateVecReal[indexLo] = recRoot2*(stateRealUp - stateRealLo);
stateVecImag[indexLo] = recRoot2*(stateImagUp - stateImagLo);
}
void statevec_hadamard(Qureg qureg, const int targetQubit)
{
int threadsPerCUDABlock, CUDABlocks;
threadsPerCUDABlock = 128;
CUDABlocks = ceil((qreal)(qureg.numAmpsPerChunk>>1)/threadsPerCUDABlock);
statevec_hadamardKernel<<<CUDABlocks, threadsPerCUDABlock>>>(qureg, targetQubit);
}
Pauli-X/Y/Z gate
The Pauli-X gate acts on a single qubit. It is the quantum equivalent of the \( X=\left[\begin{array}{ll}0 & 1 \ 1 & 0\end{array}\right]\)
void pauliX(Qureg qureg, const int targetQubit) {
validateTarget(qureg, targetQubit, __func__);
statevec_pauliX(qureg, targetQubit);
if (qureg.isDensityMatrix) {
statevec_pauliX(qureg, targetQubit+qureg.numQubitsRepresented);
}
qasm_recordGate(qureg, GATE_SIGMA_X, targetQubit);
}
the real computing part
void statevec_pauliXLocal(Qureg qureg, const int targetQubit)
{
long long int sizeBlock, sizeHalfBlock;
long long int thisBlock, // current block
indexUp,indexLo; // current index and corresponding index in lower half block
qreal stateRealUp,stateImagUp;
long long int thisTask;
const long long int numTasks=qureg.numAmpsPerChunk>>1;
// set dimensions
sizeHalfBlock = 1LL << targetQubit;
sizeBlock = 2LL * sizeHalfBlock;
// Can't use qureg.stateVec as a private OMP var
qreal *stateVecReal = qureg.stateVec.real;
qreal *stateVecImag = qureg.stateVec.imag;
# ifdef _OPENMP
# pragma omp parallel \
default (none) \
shared (sizeBlock,sizeHalfBlock, stateVecReal,stateVecImag) \
private (thisTask,thisBlock ,indexUp,indexLo, stateRealUp,stateImagUp)
# endif
{
# ifdef _OPENMP
# pragma omp for schedule (static)
# endif
for (thisTask=0; thisTask<numTasks; thisTask++) {
thisBlock = thisTask / sizeHalfBlock;
indexUp = thisBlock*sizeBlock + thisTask%sizeHalfBlock;
indexLo = indexUp + sizeHalfBlock;
stateRealUp = stateVecReal[indexUp];
stateImagUp = stateVecImag[indexUp];
stateVecReal[indexUp] = stateVecReal[indexLo];
stateVecImag[indexUp] = stateVecImag[indexLo];
stateVecReal[indexLo] = stateRealUp;
stateVecImag[indexLo] = stateImagUp;
}
}
}
void statevec_pauliXDistributed (Qureg qureg,
ComplexArray stateVecIn,
ComplexArray stateVecOut)
{
long long int thisTask;
const long long int numTasks=qureg.numAmpsPerChunk;
qreal *stateVecRealIn=stateVecIn.real, *stateVecImagIn=stateVecIn.imag;
qreal *stateVecRealOut=stateVecOut.real, *stateVecImagOut=stateVecOut.imag;
# ifdef _OPENMP
# pragma omp parallel \
default (none) \
shared (stateVecRealIn,stateVecImagIn,stateVecRealOut,stateVecImagOut) \
private (thisTask)
# endif
{
# ifdef _OPENMP
# pragma omp for schedule (static)
# endif
for (thisTask=0; thisTask<numTasks; thisTask++) {
stateVecRealOut[thisTask] = stateVecRealIn[thisTask];
stateVecImagOut[thisTask] = stateVecImagIn[thisTask];
}
}
}
__global__ void statevec_pauliXKernel(Qureg qureg, const int targetQubit){
// ----- sizes
long long int sizeBlock, // size of blocks
sizeHalfBlock; // size of blocks halved
// ----- indices
long long int thisBlock, // current block
indexUp,indexLo; // current index and corresponding index in lower half block
// ----- temp variables
qreal stateRealUp, // storage for previous state values
stateImagUp; // (used in updates)
// ----- temp variables
long long int thisTask; // task based approach for expose loop with small granularity
const long long int numTasks=qureg.numAmpsPerChunk>>1;
sizeHalfBlock = 1LL << targetQubit; // size of blocks halved
sizeBlock = 2LL * sizeHalfBlock; // size of blocks
// ---------------------------------------------------------------- //
// rotate //
// ---------------------------------------------------------------- //
//! fix -- no necessary for GPU version
qreal *stateVecReal = qureg.deviceStateVec.real;
qreal *stateVecImag = qureg.deviceStateVec.imag;
thisTask = blockIdx.x*blockDim.x + threadIdx.x;
if (thisTask>=numTasks) return;
thisBlock = thisTask / sizeHalfBlock;
indexUp = thisBlock*sizeBlock + thisTask%sizeHalfBlock;
indexLo = indexUp + sizeHalfBlock;
// store current state vector values in temp variables
stateRealUp = stateVecReal[indexUp];
stateImagUp = stateVecImag[indexUp];
stateVecReal[indexUp] = stateVecReal[indexLo];
stateVecImag[indexUp] = stateVecImag[indexLo];
stateVecReal[indexLo] = stateRealUp;
stateVecImag[indexLo] = stateImagUp;
}
void statevec_pauliX(Qureg qureg, const int targetQubit)
{
int threadsPerCUDABlock, CUDABlocks;
threadsPerCUDABlock = 128;
CUDABlocks = ceil((qreal)(qureg.numAmpsPerChunk>>1)/threadsPerCUDABlock);
statevec_pauliXKernel<<<CUDABlocks, threadsPerCUDABlock>>>(qureg, targetQubit);
}
source code analysis
tree
.
├── CMakeLists.txt
├── include
│ ├── QuEST_complex.h //determine to use native external cpp support or c complex support.
│ ├── QuEST.h //main func claim
│ └── QuEST_precision.h //define the precision
└── src
├── CMakeLists.txt
├── CPU
│ ├── CMakeLists.txt
│ ├── QuEST_cpu.c
│ ├── QuEST_cpu_distributed.c //distributed activator and implementation
│ ├── QuEST_cpu_internal.h //other cpu related headers here
│ └── QuEST_cpu_local.c //only cpu implementation
├── GPU
│ ├── CMakeLists.txt
│ └── QuEST_gpu.cu //gpu counterpart
├── mt19937ar.c //梅森旋轉算法-伪随机数矩阵生成
├── mt19937ar.h
├── QuEST.c //main func definition
├── QuEST_common.c //func activator defined here
├── QuEST_debug.h //debug information here
├── QuEST_internal.h
├── QuEST_qasm.c //is a quantum record standard, defined qasm assertion here.
├── QuEST_qasm.h
├── QuEST_validation.c //assert number of qubit here
└── QuEST_validation.h
https://www.quantum-inspire.com/kbase/introduction-to-quantum-computing
testcase analysis
mytimer.hpp
#include <time.h>
#include <sys/time.h>
double get_wall_time(){
/* A time value that is accurate to the nearest
microsecond but also has a range of years. */
struct timeval time;
// __time_t tv_sec; /* Seconds. */
// __suseconds_t tv_usec; /* Microseconds. */
if (gettimeofday(&time,NULL)){
// Handle error
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
double get_cpu_time(){
return (double)clock() / CLOCKS_PER_SEC;//directly read clock from cpu, and return with clock times cloacks per sec.
}```
random.c
- random manipulation
// total number of qubit: 30
// total number of qubit operatations: 667
// estimated time: 3783.9266747315614 second.
#include "QuEST.h"
#include "mytimer.hpp"
#include "stdio.h"
int main(int narg, char *argv[])
{
QuESTEnv Env = createQuESTEnv();
double t1 = get_wall_time();//define starting time
FILE *fp = fopen("probs.dat", "w");//open file for result
if (fp == NULL) {
printf(" open probs.dat failed, Bye!");
return 0;
}
FILE *fvec = fopen("stateVector.dat", "w");
if (fp == NULL) {
printf(" open stateVector.dat failed, Bye!");
return 0;
}
Qureg q = createQureg(30, Env);//define qubits registers
float q_measure[30];// defined q's size
// possible execution.
tGate(q, 25);
controlledNot(q, 28, 21);
controlledRotateX(q, 17, 5, 0.3293660327520663);
tGate(q, 3);
rotateX(q, 10, 4.734238389048838);
rotateY(q, 8, 4.959946047271496);
rotateZ(q, 5, 1.0427019597472071);
pauliZ(q, 0);
...
printf("\n");
for (long long int i = 0; i < 30; ++i) {
q_measure[i] = calcProbOfOutcome(q, i, 1);
printf(" probability for q[%2lld]==1 : %lf \n", i, q_measure[i]);
fprintf(fp, "Probability for q[%2lld]==1 : %lf \n", i, q_measure[i]);
}
fprintf(fp, "\n");
printf("\n");
for (int i = 0; i < 10; ++i) {
Complex amp = getAmp(q, i);
printf("Amplitude of %dth state vector: %12.6f,%12.6f\n", i, amp.real,
amp.imag);
}
double t2 = get_wall_time();
printf("Complete the simulation takes time %12.6f seconds.", t2 - t1);
printf("\n");
destroyQureg(q, Env);
destroyQuESTEnv(Env);
return 0;
}
GHZ_QFT.c
- only controlled manipulation
/* GHZ quantum circuit */
hadamard(q, 0);
controlledNot(q, 0, 1);
controlledNot(q, 1, 2);
controlledNot(q, 2, 3);
controlledNot(q, 3, 4);
controlledNot(q, 4, 5);
controlledNot(q, 5, 6);
controlledNot(q, 6, 7);
controlledNot(q, 7, 8);
controlledNot(q, 8, 9);
controlledNot(q, 9, 10);
controlledNot(q, 10, 11);
controlledNot(q, 11, 12);
controlledNot(q, 12, 13);
controlledNot(q, 13, 14);
controlledNot(q, 14, 15);
controlledNot(q, 15, 16);
controlledNot(q, 16, 17);
controlledNot(q, 17, 18);
controlledNot(q, 18, 19);
controlledNot(q, 19, 20);
controlledNot(q, 20, 21);
controlledNot(q, 21, 22);
controlledNot(q, 22, 23);
controlledNot(q, 23, 24);
controlledNot(q, 24, 25);
controlledNot(q, 25, 26);
controlledNot(q, 26, 27);
controlledNot(q, 27, 28);
controlledNot(q, 28, 29);
/* end of GHZ circuit */
/* QFT starts */
hadamard(q, 0);
controlledRotateZ(q, 0, 1, 1.5708);
hadamard(q, 1);
controlledRotateZ(q, 0, 2, 0.785398);
controlledRotateZ(q, 1, 2, 1.5708);
hadamard(q, 2);
controlledRotateZ(q, 0, 3, 0.392699);
controlledRotateZ(q, 1, 3, 0.785398);
controlledRotateZ(q, 2, 3, 1.5708);
...
available test machine
-
2node 16core each
mpi:omp=2:16
#!/bin/sh module purge spack load intel ##openmpi@3.1.5/3.1.2 export PRECISION=4 ##1/2/4 CC=icc CXX=icpc cmake -DGPUACCELERATED=0 -DDISTRIBUTED=1 .. make export OMP_NUM_THREADS=16 export FI_PROVIDER=tcp mpirun -machinefile mac -np 2 ./demo
profiling result
the most time-consuming part is statevec_compactUnitaryLocal
-
2node 16core each
mpi:omp=1:32
-
1node 1tesla v100
script
#!/bin/sh module purge spack load gcc@6 spack load cuda@10.1 ## 10.2 export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/cuda/lib64 export PRECISION=2 ##1/2 CC=gcc CXX=g++ cmake -DGPUACCELERATED=1 -DGPU_COMPUTE_CAPABILIty=70 .. make ./demo
profiling result
summary
the summary for profiling of both cpu and gpu, the most time is consumed on computing the real kernel which I think the computing power is fully utilized.
Accelerated percentage of single node over omp+mpi is 319.799/220.807=1.448319120317744
Accelerated percentage of single node over single gpu is 319.799/19.328=16.54627720533642
power consumption: over cpu:
over gpu: 111W on average
Our future plan:
- deploy the gpu code on multigpu using nccl.
- solve the global memory store and load efficiency.
misc
Loves from Github
- https://github.com/QuEST-Kit/QuEST/issues/220
Hi Jiachen,
There are no plans currently to combine distribution with GPU-acceleration. Note there are a few ways this can be done, and I suspect none really align with QuEST's design philosophy, nor are practical due to memory overheads. I've wanted to pen these thoughts for a while, so read on below if interested! :)
Firstly, QuEST uses its hardware to accelerate the simulation of a single quantum register at a time. While I think there are good uses of multi-GPU to speedup simultaneous simulation of multiple registers, this would be a totally new pattern to QuEST's simulation style. So let's consider using multi-GPU to accelerate a single register.
There are a few ways you can have "multiple GPUs":
multiple NVlinked GPUs
This is when you have multiple GPUs tightly connected with a high-bandwidth fabric (e.g. this). The bandwidth is enough that you sort of can imagine it as a single big GPU, and hence it would be worthwhile for accelerating single-register simulation. However, this only exists right now as NVLink and NVSwitch, compatible only with IBM's POWER architecture - you could argue this is still esoteric, and not worth a big refactor. Note it wouldn't actually be very hard to refactor QuEST for this platform - indeed QuEST works out-of-the-box with POWER8. But it's not something on our TODO list currently.
multiple local GPUs
This is when you have multiple GPUs on the same machine, but maybe on different sockets and hence with a much lower bandwidth between them. The most common case is two GPUs - is it worthwhile using two GPUs over one to speedup single register simulation? Often, no!
In big QC simulation, having to move memory around is often the big killer, and should be avoided where possible. Unfortunately, simulating unitaries on registers often requires moving memory. If all the memory stays in the GPU (very high "internal bandwidth"), this is ok, but copying memory to the other GPU (across the socket) will introduce a huge per-gate overhead!
Hence, using two GPUs to simulate the same register size can be slower than using just one, especially as the simulation size grows and saturates the sockets!
There's hardly a benefit from the extra VRAM too, because doubling the memory enables simulation of one additional qubit. This is not worth the slowdown, or the hardware!
Even with more than two GPUs, the connections are likely hierarchical and so even more prone to saturation.
distributed GPUs
This is when you have a GPU(s) on each distributed node of a cluster. In this circumstance, simulating a unitary gate which requires data exchange not only costs us a VRAM to RAM overhead (similar to before), but a networking overhead to talk to the other nodes! This can be somewhat improved by having a direct GPU to network-card connection (and MPI abstraction), but I believe that's pretty cutting-edge.
Let's say you have n nodes, each with a GPU and a multicore CPU, and you're resolved to a distributed simulation. When is it worthwhile to pay the extra memory overhead locally copying from RAM to VRAM (and use the GPU), over using just the CPUs? This is now the same trade-off to consider in the previous cases. So may or may not be worthwhile.
TL-DR: besides the somewhat esoteric case of having multiple tightly-connected GPUs, multi-GPU simulation introduces a new memory overhead that doesn't exist in single-GPU simulation. This overhead is almost always way longer than the time the GPU spends simulating the gate. As to whether the whole simulation is sped up by the use of multi-GPU is system and simulation specific.
- https://github.com/NVIDIA/nccl/pull/316 This is a PR for people to review and provide feedback on the p2p branch (issue #212).
Looking forward to applying the P2P function to increase the power of my project!
- THU published their modified version as ICS best paper
- NUDT modified the code using memory offloading to main DRAM with GPU Memory.
ISC
奖项
- 总冠军一名,授予在整体算例以及现场呈现过程中得分最高的队伍。
- HPL单项冠军一名,授予HPL比赛成绩最高的队伍。
- 最受欢迎奖一名,授予比赛期间得到ISC13参会者投票最多的队伍。
命题
HPL等benchmark和其他4项应用以及一道神秘赛题。
ISC 21
Rewind: https://victoryang00.cn/wordpress/2021/06/27/isc-21hui-gu/
AutoTuning 就是一个简单OI题
这题目本来是 NV 内部做 OSU 测试的一个小工具,拿出来给我们做题。题目要求根据不同 rank 之间的数据交换能力,做简单的调优,
Task 1-3: Understand MPI_alltoallv calls
Write a program with an input flag for pattern, on the Niagara cluster using 4 nodes, each with 40 ppn (full), total of 160 ppn, with balanced and unbalanced pattern.
The program should run 1000 iterations of MPI_alltoallv using the following characteristics.
Task 4: Use Go+Front end to visualize the alltoallv pattern
We chose to use Go message passing because of no need for the performance dynamically and draw using antd
design graph rather than draw directly using gnuplot2 which is legacy for display. The downsides is the frontend occupy too much memory and takes a little bit longer especially for the wrf.
Task 5: Write a online algorithm to reaffine the pattern to makes it faster.
static calculation using MPI static analysis, and do a DP swap for data red heatmap part. The code
LAMMPS
Problem Discovery
- Intel Package by W. Michael Brown speed up the CPU performance roughly 2 times on both broadwell and skylake chassis. The only difference is -XCORE-avx512
- Communication Overhead is extremely unbalanced in Protein case because the Comm::Brick::reverse_comm calls MPI_Waitany too many times. This is solvable by define the grid box.
- Kokkos by LBNL is extremely useful for resource allocation of GPU. However, GPU does not have aggressive improvement may because of the sparse data.
Result
- Intel Package Buffer - cache friendly and vectorized
- FFTW Comparison by project-gemmi/ mostly bfly 3D FFTW operation, which fftw is the best.
Lesson Learned
- The environment variable setting may affect the efficacy of the execution of the application. Besides, it may affect the efficiency of the application.
- Architecture may affect the performance.
- AVX_512 may reduce CPU frequency, hence reduce performance.
- Multi-nodes do not guarantee the performance improvement.
- Communication overhead may eat the performance gain.
- Dedicated package may introduce additional performance gain.
- Most of the gain comes from the USER-INTEL package (by inte| ${ }^{\oplus}$ ).
- We found CMake is too smart to deal with the compiler option which trigger to half size of addme array in the half neighbor computation, once we change into make, the problem was solved.
- The Protein cases still get into segfault when using the Intel Package on NSCC, we roll back to no package for that single case.
GPAW
-
Cython program
- Pros and Cons of Hybrid MPI/OMP
- 70% runtime in C, 30% runtime in Python
-
Computation intense program
- Highly depend on Math library
-
Hybrid MPI/OpenMP program
- Pros and Cons of Hybrid MPI/OMP
- Balance of MPI/OpenMP
GPU Accelerated
- ELPA
- A highly efficient and highly scalable direct eigensolvers for symmetrix(hermitian) matrices.
- with this math library, the performance can increase 3x-5x.
Profiling
- Accroding to the IPM Profile information, we figure out that MPI_Allredce is the most time comsuming.
- We have tried profile the ratio of MPI and OpenMP since it is a Hybird MPI/OpenMP program, but the performance is unstable since different python use gpaw may have different calculate routines.
Lesson Learned
-
Python GIL Lock sometimes make profile difficult.
-
Cython program usually have time cosuming part at C code, optimize this part.
-
Some General Math Library (such as MKL) may not help a lot with specific program, but some minor specific Library will.
MHM2
The code is written in UPC++
Intro
- Multiple UPC++ backend: ibv, mpi, smp, udp
- When based on mpi, UPC++ backend use the infiniband by default.
- There is no significant performance difference between mpi and ibv.
- The performance degradation after the increase of nodes is more serious than expected: more # of compute nodes: better DHT performance, but more network overhead.
- Will be discussed in next few slides.
- Profiling is a little bit difficult.
\[ \begin{array}{llrrrr} Conduit & Build Type & Report & System CPU & User CPU & nodes \ \hline \textcolor{red}{mpi} & \textcolor{red}{Release} & \textcolor{red}{37.36} & \textcolor{red}{02: 54.9} & \textcolor{red}{1: 35: 15} & \textcolor{red}{4} \ \hline mpi & Release & 60.74 & 01: 37.4 & 1: 19: 27 & 2 \ \hline \textcolor{red}{ibv} & \textcolor{red}{Release} & \textcolor{red}{37.27} & \textcolor{red}{02: 57.3} & \textcolor{red}{1: 36: 37} & \textcolor{red}{4} \ \hline ibv & Release & 61.69 & 01: 36.6 & 1: 19: 33 & 2 \ ibv & Debug & 112.3 & 03: 44.6 & 4: 54: 57 & 4 \ mpi & Debug & 134.4 & 06: 11.6 & 5: 57: 13 & 4 \ mpi & Release & 37.79 & 07: 31.1 & 1: 39: 17 & 4 \ mpi & Release & 545.35 & 1: 18: 27 & 18: 15: 26 & 4 \ mpi & Release & 104.88 & 02: 54.6 & 1: 08: 33 & 1 \end{array} \]
Profiling
- Profiler: Intel Vtune Amplifier/Profiler, Version 2019.6 UPC++ could rely on MPI, but infiniband has to be disabled to profile MPI model.
CPU utilization will be 80% if hyperthreading is disabled.
- Basically overall overhead is insignificant for small dataset (800MB)
- For large dataset (40GB), overhead is not neglectable
- Not I/O bounded, network is the bottleneck
- A lot of data exchange between nodes
- We exam the following two aspects: k-mers and DHT period
DHT Analysis
- Three period: write only, read&write, read only.
- Write only part: data will be storage localized.
- Hyperscale data transmission when read-only: all to all.
- Bottleneck: Transmission restrictions cause function await. This is mutually corroborated by the rate of performance degradation when the number of nodes increases: How to improve efficiency on larger clusters?
Innovation
-
Highly redundant distributed hash table:
- Reduce the order of the complete graph: as long as the memory allows.
- Transfer data when write-only period: Network IO not significant, generating a redundant
- For cluster with more memory: multiple redundancy.
- Both reduce compute-alns part and read-only part
-
Data reduction
- Raid5-like Memory model
- Using XOR to compute the data
-
Hyperparameter configuration
- Adjust k value in k-mers analysis
- We can achieve better results and less time comsumption by tuning the k parameter.
Lesson Learned
- Setting up environment in the cluster
- Use Spack and Module to manage user-mode packages.
- Learn how to use PBS and Slurm
- Need balance between core occupied and waiting time.
- Any optimization in parallel program is very difficult.
- Need to thoroughly consider Network, IO, Memory and core scheduling.
- Profiling in UPC++ can be hard:
- Try to use other parallelization methods.
WRF
傻逼Fortran,2021年了,居然还有人用Fortran
最好找做气象的人问问有关参数设置的问题,可惜我没找到
这是一个有关地球科学的天气模拟系统,所有有关地球科学和Fortran并行化的其他应用都可以参考一下
Task links and introductions
3 Domain Problem for ISC21 SCC
Install
required libs
HDF5, NetCDF-C, NetCDF-Fortran (手动安装可能更好,需要mpi)
HDF5
./configure --prefix=你的安装路径/hdf5 --enable-fortran --enable-fortran2003 --enable-parallel
make -j 48
make install
# vi ~/.bashrc
export HDF5=你的安装路径/hdf5
export PATH=$HDF5/bin:$PATH
export LD_LIBRARY_PATH=$HDF5/lib:$LD_LIBRARY_PATH
export INCLUDE=$HDF5/include:$INCLUDE
# source ~/.bashrc
NetCDF-C
./configure --prefix=你的安装路径/netcdf LDFLAGS="-L$HDF5/lib" CPPFLAGS="-I$HDF5/include" CC=mpiicc --disable-dap
make -j 48
make install
# vi ~/.bashrc
export NETCDF=/usr/local/netcdf
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH
export INCLUDE=$NETCDF/include:$INCLUDE
# source ~/.bashrc
NetCDF-Fortran
./configure --prefix=你的安装路径/netcdf CPPFLAGS="-I$HDF5/include -I$NETCDF/include" LDFLAGS="-L$HDF5/lib -L$NETCDF/lib" CC=mpiicc FC=mpiif90 F77=mpiif90 # 与NetCDF-C安装在同一目录下
make -j 48
make install
Advanced lib
PNetCDF A Parallel I/O Library for NetCDF File Access
4个node有负面效果,需要8个node及以上才会和NertCDF有异
安装方法见官网
Main Program
经过测试,使用intelMPI会出现segment fault,OpenMPI则不会,然而intelMPI好像并没有很多提高,可以从stack size的角度尝试修正这个问题。
env setting
intel openmpi hdf5 netcdf
config and build
./configure
checking for perl5... no
checking for perl... found /usr/bin/perl (perl)
Will use NETCDF in dir: /global/software/centos-7.x86_64/modules/intel/2020.1.217/netcdf/4.7.4
HDF5 not set in environment. Will configure WRF for use without.
PHDF5 not set in environment. Will configure WRF for use without.
Will use 'time' to report timing information
$JASPERLIB or $JASPERINC not found in environment, configuring to build without grib2 I/O...
------------------------------------------------------------------------
Please select from among the following Linux x86_64 options:
1. (serial) 2. (smpar) 3. (dmpar) 4. (dm+sm) PGI (pgf90/gcc)
5. (serial) 6. (smpar) 7. (dmpar) 8. (dm+sm) PGI (pgf90/pgcc): SGI MPT
9. (serial) 10. (smpar) 11. (dmpar) 12. (dm+sm) PGI (pgf90/gcc): PGI accelerator
13. (serial) 14. (smpar) 15. (dmpar) 16. (dm+sm) INTEL (ifort/icc)
17. (dm+sm) INTEL (ifort/icc): Xeon Phi (MIC architecture)
18. (serial) 19. (smpar) 20. (dmpar) 21. (dm+sm) INTEL (ifort/icc): Xeon (SNB with AVX mods)
22. (serial) 23. (smpar) 24. (dmpar) 25. (dm+sm) INTEL (ifort/icc): SGI MPT
26. (serial) 27. (smpar) 28. (dmpar) 29. (dm+sm) INTEL (ifort/icc): IBM POE
30. (serial) 31. (dmpar) PATHSCALE (pathf90/pathcc)
32. (serial) 33. (smpar) 34. (dmpar) 35. (dm+sm) GNU (gfortran/gcc)
36. (serial) 37. (smpar) 38. (dmpar) 39. (dm+sm) IBM (xlf90_r/cc_r)
40. (serial) 41. (smpar) 42. (dmpar) 43. (dm+sm) PGI (ftn/gcc): Cray XC CLE
44. (serial) 45. (smpar) 46. (dmpar) 47. (dm+sm) CRAY CCE (ftn $(NOOMP)/cc): Cray XE and XC
48. (serial) 49. (smpar) 50. (dmpar) 51. (dm+sm) INTEL (ftn/icc): Cray XC
52. (serial) 53. (smpar) 54. (dmpar) 55. (dm+sm) PGI (pgf90/pgcc)
56. (serial) 57. (smpar) 58. (dmpar) 59. (dm+sm) PGI (pgf90/gcc): -f90=pgf90
60. (serial) 61. (smpar) 62. (dmpar) 63. (dm+sm) PGI (pgf90/pgcc): -f90=pgf90
64. (serial) 65. (smpar) 66. (dmpar) 67. (dm+sm) INTEL (ifort/icc): HSW/BDW
68. (serial) 69. (smpar) 70. (dmpar) 71. (dm+sm) INTEL (ifort/icc): KNL MIC
72. (serial) 73. (smpar) 74. (dmpar) 75. (dm+sm) FUJITSU (frtpx/fccpx): FX10/FX100 SPARC64 IXfx/Xlfx
Enter selection [1-75] :
dm+sm: OMP+MPI
./compile -j 6 em_real >& build_wrf.log
tail -15 build_wrf.log
finish
所有执行文件都在run
文件夹中。
Run
for i in ../WRF/run/* ; do ln -sf $i $(数据所在目录) ; done
namelist.input
是输入文件,其中有众多参数需要设置,可以参考WRF NAMELIST.INPUT FILE DESCRIPTION。
slurm script
#!/bin/bash -l
#SBATCH -N 4
#SBATCH --ntasks-per-node=20
#SBATCH --cpus-per-task=2
#SBATCH --ntasks=80
#SBATCH -J wrf3Dom_mpi_80_omp_2
#SBATCH -p compute
#SBATCH -t 2:00:00
#SBATCH -o wrf3Dom-%j.out
sleep 300
module load NiaEnv/2019b
module load intel/2019u4 openmpi/4.0.1
#hdf5/1.10.5
#module load netcdf/4.6.3
ulimit -c unlimited
ulimit -s unlimited
module list
export HDF5=/home/l/lcl_uotiscscc/lcl_uotiscsccs1034/scratch/nonspack/hdf5
export PATH=$HDF5/bin:$PATH
export LD_LIBRARY_PATH=$HDF5/lib:$LD_LIBRARY_PATH
export INCLUDE=$HDF5/include:$INCLUDE
export NETCDF=/home/l/lcl_uotiscscc/lcl_uotiscsccs1034/scratch/nonspack/netcdf
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH
export INCLUDE=$NETCDF/include:$INCLUDE
export KMP_STACKSIZE=20480000000
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
cd ~/scratch/pl/orifiles
mpirun -np 80 -cpus-per-rank $SLURM_CPUS_PER_TASK ./wrf.exe
Important Notice
stack size
and segment fault
ulimit
sets the OS limits for the program.
KMP_STACKSIZE
tells the OpenMP implementation about how much stack to actually allocate for each of the stacks. So, depending on your OS defaults you might need both. BTW, you should rather use OMP_STACKSIZE
instead, as KMP_STACKSIZE
is the environment variable used by the Intel and clang compilers. OMP_STACKSIZE
is the standard way of setting the stack size of the OpenMP threads.
Note, that this problem is usually more exposed, as Fortran tends to keep more data on the stack, esp. arrays. Some compilers can move such arrays to the heap automatically, see for instance -heap-arrays
for the Intel compiler.
Fortran的OMP进程会在stack里塞一大堆东西,很多时候会爆栈,所以使用Fortran和OMP的应用需要注意export KMP_STACKSIZE=20480000000
, 而且gcc
是OMP
,icc
是KMP
。
Fortran and MPI
不知道是slurm还是Fortran的问题,slurm不能对Fortran的MPI程序自动分配CPU核心,所以需要手动设置,
mpirun -np 16 -cpus-per-rank $SLURM_CPUS_PER_TASK ./wrf.exe
tell mpi how many cpu cores should one mpi rank get for openmp
IPM Report env setting
IPM是一个监控MPI使用的profiler。使用IPM只需要perloadIPM的lib就可以了。但是为了完整生成报告图片,需要设定以下变量
export IPM_REPORT=full
export IPM_LOG=full
When using IPM, set above envs to make sure you can get right xml to visualize, or using https://files.slack.com/files-pri/TAXMW9014-F02586VN27L/download/ipm.ipynb to visualize
Others
训练起飞中
Incompact3D
It's the incompressible Navier–Stokes equations using sixth-order compact schemes for spatial discretization. It basically implement a ODE with numerical methods called Multistep Methods.
the Poisson equation is fully solved in spectral space using Fast Fourier Transform (FFT) routines
Intro to the algorithm and implementation
Test Case Taylor
Build for MKL/FFTW3
Reminder:
- Enable MKL speedup on AMD Platform
int mkl_serv_intel_cpu_true() {
return 1;
}
- FFTW3 migrate to CuFFT.
Build libnpc
with spack
I don't know why...it fails to build when MPICXX
is set...
Here is a quick hack
class Libnbc(AutotoolsPackage):
"""LibNBC is a prototypic implementation of a nonblocking
interface for MPI collective operations. Based on ANSI C and
MPI-1, it supports all MPI-1 collective operations in a
nonblocking manner. LibNBC is distributed under the BSD license.
"""
homepage = "http://unixer.de/research/nbcoll/libnbc/"
url = "http://unixer.de/research/nbcoll/libnbc/libNBC-1.1.1.tar.gz"
version('1.1.1', sha256='63aa5f75f84c191da0688cb551ebd0e9e46928edfba350b2a534eb0c704dd9c3')
depends_on("mpi")
+ def configure_args(self):
+ args = []
+ args.append("MPICXX=")
+ return args
Reference
- Incompact3d: A powerful tool to tackle turbulence problems with up to \(O\left(10^{5}\right)\) computational cores
NWChem
ICON
Prepare
git clone https://gitlab.dkrz.de/icon-scc-isc22/icon-scc
cd /path/to/icon-scc
git submodule init
git submodule update
How to run
spack compile
spack install -j (nproc) -vvvv icon%gcc@6.4.0
There are some varients:
- debug
- cuda
- openmp
run copy scripts
cd {ICON_BUILD_DIR}
export ICON_DIR={ICON_DIR}
# Copy runscript-related files when building out-of-source:
if test $(pwd) != $(cd "${ICON_DIR}"; pwd); then
echo "Copying runscript input files from the source directory..."
rsync -uavz ${ICON_DIR}/run . --exclude='*.in' --exclude='.*' --exclude='standard_*'
ln -sf -t run/ ${ICON_DIR}/run/standard_*
ln -sf set-up.info run/SETUP.config
rsync -uavz ${ICON_DIR}/externals . --exclude='.git' --exclude='*.f90' --exclude='*.F90' --exclude='*.c' --exclude='*.h' --exclude='*.Po' --exclude='tests' --exclude='rrtmgp*.nc' --exclude='*.mod' --exclude='*.o'
rsync -uavz ${ICON_DIR}/make_runscripts .
ln -sf ${ICON_DIR}/data
ln -sf ${ICON_DIR}/vertical_coord_tables
fi
Gen sbatch
cd {ICON_BUILD_DIR}
export ICON_DIR={ICON_DIR}
cd {ICON_BUILD_DIR}/run
$ICON_DIR/utils/mkexp/mkexp standard_experiments/scc.config CO2=2850
OK if
Script directory: '/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/experiments/exp_scc2850/scripts'
Data directory: '/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/experiments/exp_scc2850/outdata'
Work directory: '/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/experiments/exp_scc2850/work'
Modify sbatch
In experiments/exp_scc2850/scripts/exp_scc2850.run_start
- FIX SLURM args
- FIX path
- no
/build/
- no
/home/qinfr
- no
- BUILD_DIR=/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/BUILD
+ BUILD_DIR=/mnt/nfs4/node1/home/qinfr/spack/opt/spack/linux-ubuntu20.04-zen/gcc-6.4.0/icon-2021-isc-scc-hw7pyldsuxsug2jrnmhdulvk5knzbzw6/
+ export PATH={cdo-1.9.10_BUILD_DIR}/bin:$PATH
...
- Subsitute all
/home/qinfr
to/mnt/nfs4/node1/home/qinfr/
Run
sbatch exp_scc2850.run_start
Tips
How to check if compiled code uses SSE and AVX instructions?
objdump -d cgribexlib.o | awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmu
lsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vf
nmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub2
13pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpb
roadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherd
d|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsll
vq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpc
mpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcomp
ressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2
ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64
x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb
|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvt
tpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vc
vttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|
vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vf
ixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrs
qrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|v
pminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatter
qq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcast
mw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgath
erpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0q
ps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpcl
assss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|
vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|v
p4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vps
hrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec
|vaesdeclast|vaesenc|vaesenclast)[ \t]/'
MiniVite
概述
资料
ghosh2018.pdf minivite-indyscc.pdf
算法
对于每个社区,可以计算 Modularity,整个图的 Modularity 就是每个社区的 Modularity 加总。通过改变社区的划分来影响 Modularity.
目标:最大化 Modularity
Louvain method 是迭代算法,初始每个节点属于自己社区。
对于节点 $u$,考虑每个节点的邻居(有边相连的节点)$v$,将 $u$ 的社区改为 $v$ 的社区会对 Modularity 有一个影响量 $\Delta Q$,$\Delta Q$ 是可以快速计算的。遍历所有邻居以后可以得到一个最大的 $\Delta Q$,如果 $\Delta Q>0$,就将 $u$ 的社区改为 $v$ 的社区。
一个迭代就是考虑一遍所有节点 $u$,当 $\Delta\text{Modularity}<\text{threshold}$ 就停止。
并行化:切分图的节点集合,分给每个计算节点一些图的节点,并行考虑图的节点。
代码比较短,可以阅读
编译
可以使用 spack 中的 miniVite, 但是版本比较老,需要改 package.py.
也可以直接用 GitHub 源编译,使用 gcc 可能要把 Makefile 中的 -xHost
替换为 -march=native
,-qopenmp
替换为 -fopenmp
.
https://github.com/ECP-ExaGraph/miniVite
运行
摘抄自 README
mpiexec -n 2 bin/./minivite -f karate.bin
mpiexec -n 2 bin/./minivite -l -n 100
mpiexec -n 2 bin/./minivite -n 100
mpiexec -n 2 bin/./minivite -p 2 -n 100
[On Cray systems, pass MPICH_MAX_THREAD_SAFETY=multiple or
pass -DDISABLE_THREAD_MULTIPLE_CHECK while building miniVite.]
Possible options (can be combined):
1. -f <bin-file> : Specify input binary file after this argument.
2. -b : Only valid for real-world inputs. Attempts to distribute approximately
equal number of edges among processes. Irregular number of vertices
owned by a particular process. Increases the distributed graph creation
time due to serial overheads, but may improve overall execution time.
3. -n <vertices> : Only valid for synthetically generated inputs. Pass total number of
vertices of the generated graph.
4. -l : Use distributed LCG for randomly choosing edges. If this option
is not used, we will use C++ random number generator (using
std::default_random_engine).
5. -p <percent> : Only valid for synthetically generated inputs. Specify percent of overall
edges to be randomly generated between processes.
6. -t <threshold> : Specify threshold quantity (default: 1.0E-06) used to determine the
exit criteria in an iteration of Louvain method.
7. -w : Only valid for synthetically generated inputs. Use Euclidean distance as edge weight.
If this option is not used, edge weights are considered as 1.0. Generate
edge weight uniformly between (0,1) if Euclidean distance is not available.
8. -r <nranks> : This is used to control the number of aggregators in MPI I/O and is
meaningful when an input binary graph file is passed with option "-f".
naggr := (nranks > 1) ? (nprocs/nranks) : nranks;
9. -s : Print graph data (edge list along with weights).
作业
题目
Access the following server and download the two graph inputs (they are in a binary format). Server: "sftp indyscc@N/A" Password: "N/A"
The homework consists of two parts, and each part has two/three questions (checking the appropriate documents from the code repository can save time):
- Establishing baseline performance: Download and build the default/main/master branch of miniVite (https://github.com/ECP-ExaGraph/miniVite), run it using the provided com-orkut and webbase-2001 input graphs on 1-20 nodes (to perform strong scaling experiments). Answer the following questions: How are these two input graphs different? What arguments did you choose to run miniVite? Does increasing the number of OpenMP threads help the performance (try 2-3 combinations of threads-per-process, keeping the “processes*threads-per-process” quantity the same)? Why or why not?
- Performing further optimizations: Find a combination of miniVite arguments and/or macros (arguments are discussed in the README, but for macros, you may need to look elsewhere), in addition to the baseline arguments/options that you ran miniVite with in the previous step, that improves the overall performance and scalability. Compare baseline performance with the improved version – plot it (X-axis: #Processes(nodes) and Y-axis: “Average total time (in s)” as reported by miniVite), and discuss. Does your set of options affect the output quality (expressed via modularity and MODS) in any way? If so, discuss. Submission Instructions The assignment is assigned to all students. However, a single submission per team is sufficient. One member of the team can submit the assignment. The report can be a PDF file (preferred method) or a link to a google doc (we will check the timestamp for when it was last edited). Please include your team name and the university in the report.
修改 Spack 的 package.py
需要加入一些编译选项,故需要修改编译脚本:
# Copyright 2013-2022 Lawrence Livermore National Security, LLC and other
# Spack Project Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (Apache-2.0 OR MIT)
from spack.package import *
class Minivite(MakefilePackage):
"""miniVite is a proxy application that implements a single phase of
Louvain method in distributed memory for graph community detection.
"""
tags = ["proxy-app", "ecp-proxy-app"]
homepage = "https://hpc.pnl.gov/people/hala/grappolo.html"
git = "https://github.com/Exa-Graph/miniVite.git"
version("develop", branch="master")
version("1.0", tag="v1.0")
version("1.1", tag="v1.1")
variant("openmp", default=True, description="Build with OpenMP support")
variant("opt", default=True, description="Optimization flags")
variant("mode",default='default',description="mode",values=('collective','sendrecv','rma','default','rma_accu'))
variant("omp_schedule", default=False, description="Enable OMP schedule")
variant("use_32_bit_graph", default=False, description="Use 32bit graph")
depends_on("mpi")
@property
def build_targets(self):
targets = []
cxxflags = ["-std=c++11 -g -DCHECK_NUM_EDGES -DPRINT_EXTRA_NEDGES"]
ldflags = []
if "+openmp" in self.spec:
cxxflags.append(self.compiler.openmp_flag)
ldflags.append(self.compiler.openmp_flag)
if "+opt" in self.spec:
cxxflags.append(" -O3 ")
if self.spec.variants['mode'].value == 'collective':
cxxflags.append("-DUSE_MPI_COLLECTIVES")
elif self.spec.variants['mode'].value == 'sendrecv':
cxxflags.append("-DUSE_MPI_SENDRECV")
elif self.spec.variants['mode'].value == 'rma':
cxxflags.append("-DUSE_MPI_RMA")
elif self.spec.variants['mode'].value == 'rma_accu':
cxxflags.append("-DUSE_MPI_RMA -DUSE_MPI_ACCUMULATE ")
if "+omp_schedule" in self.spec:
cxxflags.append("-DOMP_SCHEDULE_RUNTIME")
if "+use_32_bit_graph" in self.spec:
cxxflags.append("-DUSE_32_BIT_GRAPH")
targets.append("CXXFLAGS={0}".format(" ".join(cxxflags)))
targets.append("OPTFLAGS={0}".format(" ".join(ldflags)))
targets.append("CXX={0}".format(self.spec["mpi"].mpicxx))
return targets
# 下面省略
本道题目在启用 USE_MPI_RMA
后,性能获得较大提升。
报告
评价
- As a response to the first question, why do you think orkut's running time is longer even though it is smaller in size compared to webbase?
- How many OpenMP threads per process for the baseline strong scaling experiments?
- In part 1, you provide a brief discussion on observed load imbalance. But, you do not mention how you mitigated it in part 2.
- It would have been interesting to modulate the threshold and measure the effect on performance, and check the impact on MODS
3/5 + 5/5 + 30/40 + 25/25 + 18/25 = 81/100
Final
题目
This assignment has two parts, strong scale and weak scale. Like in homework #1, you will download and build miniVite: https://github.com/ECP-ExaGraph/miniVite
Strong scale
Use the com-friendster graph as input to miniVite, and the optimization arguments that you learned about during the last homework to perform strong scaling experiments (any option that improves the performance is acceptable, even if quality in terms of modularity is affected somewhat).
For this input, there will be startup issues (out-of-memory related crash or slowness) if you use a relatively small #nodes to begin or limited optimization arguments.
The goal of this exercise is to find a set of arguments and options (which may differ among process configurations) that maximizes strong scalability for this input, without compromising quality/modularity too much (rounding off final modularity to the first decimal place should yield similar values no matter your choice of optimizations). (Don’t try to use -DDUSE_32BIT_GRAPH, it won’t work)
i. Pick x where x is the startup node, and then scale the #nodes by incrementing x by a fixed stride to get the next process/node configuration, continue until x == 20. (pick any combination of processes-per-node and threads-per-process that yields better performance) You can vary processes-per-node as you see fit. How did you pick the base x?
ii. Report graph loading/construction times, #iterations to convergence, the time to perform the Louvain graph clustering as reported by miniVite running on the nodes as per 1.a.i.
Also, mention the arguments that you passed to miniVite and options you build it with.
Weak scale
Use the miniVite options to generate a distributed input graph (see FAQs and README) that scales with the #processes. Pick a reasonable number of vertices (this is governed by a formula – see FAQ, if miniVite complains, just adjust the #vertices or #processes)
i. Start with 1 node (any #processes-per-node and #threads-per-process configuration that makes sense to you) and end at 20 nodes. Plot the time to generate the graph, time to perform graph clustering (using data returned by miniVite) on 1-20 nodes.
ii. How large is the graph you generated on 20 nodes vs. 1 node? (Larger is better, but too large will take too much time in graph generation).
For submission, Create two directories called weak_scale and strong_scale and put the documents that answers the questions for each category in their respective directories.
题目分析
- Strong scaling, 即固定问题规模,增加并行数量,减少运行时间。理想情况是 $\text{time with (n) nodes}=\frac{\text{time with 1 node}}{\text{number of nodes}}$
- Weak scaling,即固定每个并行节点的运算量,增加并行数量(问题规模同时增加)。理想情况是运行时间不变(没有任何并行带来的额外开销)。
OpenMPI & OpenMP 调参
每个机器是 2 个 E5-2660 v3,总共 20 cores.
经过一些尝试,OpenMP 开单线程,MPI 开到 20 效果最好。
PPN | OMP_NUM_THREADS | Clustering |
---|---|---|
20 | 1 | 100.445 |
20 | 2 | 102.023 |
20 | 4 | 108.56 |
12 | 3 | 132.041 |
10 | 2 | 146.281 |
似乎表明 OpenMP 并行效果不如 MPI,可能是 OpenMP atomic 开销太大,但是没有做过 profiling 不能确定。 每个 MPI 进程都会开一个数据结构存储全图所有节点的信息,内存开销大。
mpirun --hostfile ./hostfile -n 400 -map-by core --bind-to core miniVite -f com-friendster.ungraph.bin -b -t 0.0015
Profile
原程序用的 std::set
std::map
std::unordered_set
std::unordered_map
太慢了,换成第三方快速 HashTable 实现能加速很多倍。
原算法开了一个不必要的 vector
也可以优化掉。
Weak scale
用 miniVite 自带的算法生成图,生成非常慢,而且进程个数必须是 $2^n$,导致只能跑进程数 $1,2,4,8,16,32,64,128,256$。 没有用 oversubscribing 因为不太符合 weak scaling 的意思,而且跑出来数据可能会很难看(指波动大)。
提交
strong-scale-report.pdf weak-scale-report.pdf 0001-modify.patch
HPL & HPL Hero Run
此次 Indy SC 比赛中的 HPL benchmark 分为两组题目,一组为 HPL 在单节点上的调优,另外一组为在整个集群上运行 HPL (约300个节点).
HPL
Hi teams
Here is the assignment on HPL! Compute the theoretical peak FLOPs for the processor type available on Chameleon cloud for the IndySCC competition. (10 points) Build and install the HPL benchmark using your choice of linear algebra library and MPI library. Which linear algebra library did you choose and why? Which MPI library did you choose and why? (20 points) Run the HPL benchmark on a node using a fixed problem size (N) and by varying the number of cores from 2 to 20, doubling the cores at each trial. Which parameters did you need to change? Plot the GFLOPs number for each run vs. no. of cores. (30 points) Run the HPL benchmark using all 20 cores and tune the HPL.dat file to achieve the best GFLOPs number. Which parameters did you tune? What were your results using the unoptimized parameter(s) vs. the optimized parameter(s)? (40 points) (Bonus) Run HPL on 2-nodes. Was the GFLOPs number exactly twice that of your single-node performance? Why or why not? (10 points) Deliverables: Submit a report answering the questions in the assignment. While describing your steps, be brief and to the point. You should also include a description of your cluster and how you created the cluster. For each of the steps 2-5, include the scripts that you used to build and run HPL. Provide sufficient documentation for your codes. Please note that your environment and scripts should be reproducible, i.e., the judges should be able to run them on an empty cluster. For the tuning step 4, include the final HPL.dat file that gave you the best performance. Submission Instructions The assignment is assigned to all students. However, a single submission per team is sufficient. One member of the team can submit the assignment. The report can be a PDF file (preferred method) or a link to a google doc (we will check the timestamp for when it was last edited). Please include your team name and the university in the report.
HPL Hero Run
Between now and Oct 20 you are allowed to burst up to 30 compute nodes to test scaling your node deployment processes. Hero HPL Runs On Oct 20 we will begin hero HPL runs. Each team will get a 23 hour window where they will be allowed to use up 300 compute nodes to complete the best HPL score they can get. Ground Rules During this time, the other teams will only be allowed to use their 1 head node to continue testing. At the beginning of this phase we will shut down any compute nodes still running. Resource usage during this time will be closely monitored - any interference, accidental or otherwise, with the competing team will be penalized per the IndySCC rules and at the discretion of the committee. Schedule Each window will begin at 9 am Eastern Time and will end at 8 am Eastern Time the following day. We will need an hour to make sure your nodes are shut down and ready to go for the next team. The schedule was randomly generated and assigned and is listed below. If you would like to swap dates, you are responsible for finding another team to swap with. You may use the Google classroom to ask if anyone else needs to swap and is willing. Once the first team begins, the schedule will be locked in place. Otherwise, we cannot change any dates unless mutually agreed upon and there isn’t enough time left in the month to have alternative days.
Thu Oct 20 - Durham University Fri Oct 21 - CUHK Sat Oct 22 - Georgia Tech Sun Oct 23 - SUSTech Mon Oct 24 - ShanghaiTech Tue Oct 25 - CSC/Finland Wed Oct 26 - Monash University Thu Oct 27 - Clemson Fri Oct 28 - U Indonesia Sat Oct 29 - UTEP Sun Oct 30 - TAMUCC Helpful Hints A successful team may strategize to run HPL in increasing increments of nodes, rather than try for the full 300 nodes at once. It is OK if you don’t use the full number of nodes with your final score, getting hundreds of nodes to work nicely together in real-life benchmarking exercises takes time and may be difficult to do in the short period you have them. It would be advantageous to have a very good score on a smaller set of nodes than to struggle to get the full 300 running, run out of time, and not have a score at all.
Also keep in mind that this hardware is fairly old and there are likely a handful of bad/slow nodes. This is where slowly increasing the number of nodes you are using can come in handy. Run Into Hardware Problems or Outages?
If you identify a slow node, we will not be able to fix it as the only spare parts are in the form of the other nodes, and it’s unlikely we will be able to fix it by the end of your window. You should exclude the slow node and move onto another node. If you scale up to the full 300 nodes, you may deploy extras, just make note of the slow nodes and document that your final submission uses no more than 300.
If there is any disruption of resource availability during your window, such as a Chameleon outage or power/networking outage at the Purdue site, you may receive some make-up time in certain situations, as described in the next paragraph.
If the resources are cumulatively unavailable for more than 4 hours, at the discretion of the committee, you may receive an equivalent time slot (ie, if you were unable to access for 5 hours, you would get another 5 hours) plus an extra hour for node spin-up at the end to try again. This time would be disjoint and come at the end of the above schedule as we can’t shift the rest of the schedule for the remaining teams.
If you are the last team, we will ask that you shut down at your designated time, as we need time to determine the outage length adjustments for everyone. You’ll be able to restart at a later time, and that will keep things fair as the other teams wouldn’t have the ability to just extend their window.
Outages of less than 4 hours will unfortunately be considered lost time. These shorter blips are part of real life challenges, and it would not be practical to allow make-up time for shorter times as you’d spend more time just spinning nodes back up.
We also cannot accommodate internet outages at your locale as we can’t verify those outages, so you may want to have a plan to find another connection if that happens.
Once all teams are done running, we will open the nodes back up for you to configure for the final 48 hour competition. Submitting results
Final score submissions are due 1 hour after your window ends, right as the next team is starting. We will follow up with more details on this.
报告
参考资料
Official Documentation AMD HPL Benchmark Run HPL on Threadripper 基于 HPL测试的集群系统性能分析与优化
方案
调优 HPL.dat
具体过程请参见报告。
在 HPL.dat 中可以使用多组参数,这样跑一次 HPL 可以得到多个测试结果,效率会高一些。 {.is-info}
调优 MPI 参数
进程还是线程
HPL 底层数学库可以利用多线程,所以可以让1个 HPL 进程利用整个 Socket。具体哪种的性能更好需要测试。
Intel 数学库有三套方案:单线程执行,OpenMP 和 Intel TBB,后两者可以利用多核。
具体调优情况参见报告。
绑核
numactl 可以用来绑定 HPL 使用的内存在哪一个 numa 上。 绑核的教程参见: IBM MPI documentation Understanding-MPI-map-by-and-bind-to-option
Author
@Zecheng Li
Tutorial
Video: https://drive.google.com/file/d/1c2bD3gZw5ZeJS81i1uXY6eAXf70t8bZo/view
NAMD 2.14 User’s Guide: https://www.ks.uiuc.edu/Research/namd/2.14/ug/
NAMD Tutorial: http://www.ks.uiuc.edu/Training/Tutorials/namd/namd-tutorial-unix-html/index.html
VMD User’s Guide: https://www.ks.uiuc.edu/Research/vmd/current/ug/
VMD Tutorial: https://www.ks.uiuc.edu/Training/Tutorials/vmd/tutorial-html/
Building
We use spack as the package manager. To build a simple version without MPI support:
spack install namd
In our competition, we need to support multiple nodes, so we choose to install charmpp with MPI backend and SMP enabled. The TCL interface is included for parsing the input file. The below command will depend on the OpenMPI with ucx support.
spack install -v namd%gcc interface=tcl ^charmpp backend=mpi ^openmpi fabrics=ucx
We could also use a pure MPI version. But unlike some other applications, NAMD optimized its performance for multi-threading, so the SMP version is usually faster than MPI when a single node has multiple cores. When running, we should use more communication threads (one per numa) for larger-scale jobs.
Tuning performance
We could try different compilers to build NAMD. The performance critical part of NAMD is the force calculation implemented in the source of NAMD itself (instead of in some math libraries), so compiler optimization is crucial for the performance.
We could try different compilers: (oneapi is not supported by charm++ that NAMD 2.14 depends on)
spack install -v namd%intel interface=tcl ^charmpp%intel backend=mpi ^openmpi fabrics=ucx
spack install -v namd%nvhpc interface=tcl ^charmpp%nvhpc backend=mpi ^openmpi fabrics=ucx
spack install -v namd%clang interface=tcl ^charmpp%clang backend=mpi ^openmpi fabrics=ucx
As you may find, Intel compiler might yield better performance than gcc, and NVHPC and clang are extremely slow. Don't worry. Have a look at the build scripts of NAMD.
spack edit namd
if self.spec.satisfies("^charmpp@:6.10.1"):
optims_opts = {
"gcc": m64
+ "-O3 -fexpensive-optimizations \
-ffast-math -lpthread "
+ archopt,
"intel": "-O2 -ip -qopenmp-simd" + archopt,
"aocc": m64
+ "-O3 -ffp-contract=fast -ffast-math \
-fopenmp "
+ archopt,
}
else:
optims_opts = {
"gcc": m64
+ "-O3 -fexpensive-optimizations \
-ffast-math -lpthread "
+ archopt,
"intel": "-O2 -ip " + archopt,
"aocc": m64
+ "-O3 -ffp-contract=fast \
-ffast-math "
+ archopt,
}
It did not set the optimization flags for clang and NVHPC. We could add them by ourselves. Below is an example; you should try different flags based on your architecture.
"clang": m64 + "-O3 -ffp-contract=fast -ffast-math -mprefer-vector-width=256 " + archopt,
"nvhpc": m64 + "-O3 -fast " + archopt,
Then we could build NAMD with clang and NVHPC. Surprisingly, clang is faster than Intel compiler on our machine (Haswell).
From the charm++ documentation, we could find that we can compile different load balancer modules into NAMD with different flags, but the default spack build script did not include them. Having a suitable load balancer for your architecture and interconnect is important. Some load balancers might cause a compile error since multiple definitions.
-module TreeLB -module RecBipartLB ... links the listed LB modules into an application, which can then be used at runtime via the +balancer option.
spack edit namd
# in function def edit(self, spec, prefix): add line
opts.extend(["-module CentralLB -module DistributedLB"])
From our experience, the default load balancer, the CentralLB, and the DistributedLB should be chosen based on the input and the architecture. It will bring about a 5-10% performance difference. You can also experiment with other load balancers or even write your own load balancer (not that hard!).
There are also some other flags that could be tuned. Since our machine does not support AVX512, we did not try the Intel-optimized AVX512 blocking version of NAMD.
Assignment
-
Quick Start & Homework: https://gitlab.msu.edu/vermaaslab/indysccnamd/-/tree/main
-
Running Command:
# One per node with SMP mpirun -np 20 -hostfile ~/work/host20 -bind-to core -map-by ppr:1:node -x PATH namd2 +ppn 19 +pemap 1-19 +commap 0 run.namd # One per NUMA with SMP (You should check your NUMA topology first) mpirun -np 40 -hostfile ~/work/host20 --bind-to core --map-by ppr:2:node -x PATH namd2 +ppn 9 +pemap 1-4,10-14,5-9,15-19 +commap 0,5 ./run.namd # Run with 19 replicas mpirun -np 19 -hostfile ~/work/host20 --bind-to core --map-by ppr:1:node -x PATH namd2 +replicas 19 +balancer DistributedLB +ppn 20 +pemap 0-19 +commap 0 +stdout output/out.%d.log ./replicaconfig.namd
Here we oversubscribe the cores. Since core 0 is lightly loaded when communication is not heavy, we can also assign it to computation. Note that in above commands, we always put core 0 or core 5 to communication. This is because we have set most of the interrupt affinity to core 0 and core 5, using them could get better performance on both communication and computation. (it will be up to 5% slower if you use other cores)
-
MPI Reference: https://www.ks.uiuc.edu/Research/namd/2.9/ug/node87.html
-
Our submission: https://drive.google.com/file/d/1HqxWP6YJIr06wz6ANMHog3v59HnhV7T2/view?usp=share_link
Final
-
Problem Set: https://drive.google.com/file/d/1zyWpv-bfN2uzke7RqnpS8PI6Q-AQFtKb/view?usp=share_link
-
Our Submission: https://drive.google.com/drive/folders/1dpVS6027vJTsbxlOjfMEGdftOfQyCmlO?usp=share_link
Experience
NAMD配置文件参数介绍:
- NAMD configuration parameters: https://www.ks.uiuc.edu/Research/namd/2.9/ug/node12.html
- Non-bonded Interaction & Parameters: https://www.ks.uiuc.edu/Research/namd/2.10b2/ug/node23.html
调参方法:
-
整体思路:在不跑崩的范围内,timestep和nonbondedFreq的乘积、timestep和fullElectFrequency的乘积尽可能大
-
输出间隔也对性能有影响,调小后约能提升5-10%性能
-
Reference:
SC21
Website: https://sc21.supercomputing.org/program/studentssc/student-cluster-competition/ Rewind: https://victoryang00.cn/wordpress/2021/11/18/sc21-shi-bai-hui-gu/
Quantum Espresso
https://github.com/QEF/q-e
compile
Could not find MPI (Missing MPI_FORTRAN_FOUND)
solve: -DMPIEXEC_EXECTUABLE=${MPI_HOME}/bin/mpiexec
The compiled version does not support OpenMP and only support up to 4 processes for MPI.
Add the options:
-DQE_ENABLE_OPENMP=ON
-DCMAKE_Fortran_COMPILER=${MPI_HOME}/bin/mpifort
-DOpenMP_C_FLAGS=-fopenmp=lomp
-DOpenMP_CXX_FLAGS=-fopenmp=lomp
-DOpenMP_C_LIB_NAMES=libomp
-DOpenMP_CXXLIB_NAMES=libomp
-DOpenMP_libomp_LIBRARY=/usr/lib/x86_64-linux-gnu/libomp.so.5
Change Toolchain to System
.
Add -g
to CMakeList.txt
to get additional debug information.
set(CMAKE_CXX_FLAGS -g)
set(CMAKE_C_FLAGS -g)
set(CMAKE_Fortran_FLAGS -g)
https://www.quantum-espresso.org/Doc/user_guide/
library configure: https://www.quantum-espresso.org/Doc/user_guide/node11.html
test
In directory /q-e/test-suite/
, use make run-tests
to test the correctness of basic functionalities.
run
spack load ucx/gji
/home/qe/q-e/bin/pw.x
To control the number of processors in each group, command line switches: -nimage, -npools, -nband, -ntg, -ndiag or -northo (shorthands, respectively: -ni, -nk, -nb, -nt, -nd) are used. As an example consider the following command line:
mpirun -np 4096 ./neb.x -ni 8 -nk 2 -nt 4 -nd 144 -i my.input
This executes a NEB calculation on 4096 processors, 8 images (points in the configuration space in this case) at the same time, each of which is distributed across 512 processors. k-points are distributed across 2 pools of 256 processors each, 3D FFT is performed using 4 task groups (64 processors each, so the 3D real-space grid is cut into 64 slices), and the diagonalization of the subspace Hamiltonian is distributed to a square grid of 144 processors (12x12).
mpirun -np 24 -x PATH --oversubscribe -x OMP_NUM_THREADS=4 -x LD_LIBRARY_PATH=/opt/nonspack/ucx-1.10.0-gcc/lib --allow-run-as-root /home/qe/q-e/bin/pw.x < ./ausurf.in
First run with 24 processes and 4 thread each:
Problem: OMP threads can only use up to 200% CPU per process even with 256 threads per process.
Analyze
Static Analysis
Using lizard
Fortran:
Total nloc Avg.NLOC AvgCCN Avg.token Fun Cnt Warning cnt Fun Rt nloc Rt
599949 54.1 10.6 569.7 9939 1693 0.17 0.58
C:
Total nloc Avg.NLOC AvgCCN Avg.token Fun Cnt Warning cnt Fun Rt nloc Rt
52039 152.5 3.0 1050.3 323 19 0.06 0.53
Python:
Total nloc Avg.NLOC AvgCCN Avg.token Fun Cnt Warning cnt Fun Rt nloc Rt
8864 18.3 5.0 146.0 298 21 0.07 0.26
Profiling result
All the GPU versionn test case seems to have IEEE underflow, trigger by the FFTlib, which should be fixed. Since the developing team of this project still aggressively develop the application to tailor to GPU.
We chose to use a case called si.scf.david.in to profile on single GPU. And here's the profiling result.
=117847== Profiling application: /home/qe/bin/pw.x -i ./si.scf.david.in
==117847== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 8.72% 22.118ms 140 157.98us 157.82us 159.81us usnldiag_collinear_79_gpu
6.46% 16.390ms 1360 12.051us 11.840us 189.41us init_us_2_base_gpu_216_gpu
5.29% 13.411ms 10 1.3411ms 1.3407ms 1.3417ms rotate_wfc_k_gpu_146_gpu
4.24% 10.763ms 370 29.090us 28.704us 32.928us ylmr2_gpum_ylmr2_gpu_kernel_
3.71% 9.4250ms 1127 8.3620us 6.5280us 17.664us volta_zgemm_32x32_nn
3.23% 8.1880ms 1224 6.6890us 6.5920us 7.1040us init_us_2_base_gpu_220_gpu
2.68% 6.8026ms 680 10.003us 9.8560us 10.784us init_us_2_base_gpu_185_gpu
2.67% 6.7818ms 340 19.946us 19.744us 21.280us init_us_2_base_gpu_206_gpu
2.61% 6.6090ms 340 19.438us 19.295us 21.504us init_us_2_base_gpu_158_gpu
2.46% 6.2396ms 689 9.0560us 7.2000us 14.432us void zgemm_largek_warp<bool=1, bool=0, bool=1, bool=0, int=3, int=3, int=4, int=3, int=2, int=2, int=9>(double2*, double2 const *, double2 const *, int, int, int, int, int, int, double2 const *, double2 const *, double2, double2, int, int, int*, int*)
2.28% 5.7953ms 159 36.448us 19.392us 43.200us cegterg_gpu_493_gpu
2.20% 5.5704ms 1104 5.0450us 4.1600us 11.488us void composite_2way_fft<unsigned int=20, unsigned int=4, unsigned int=32, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=5, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
2.17% 5.4956ms 478 11.497us 11.359us 12.864us add_vuspsi_k_gpu_242_gpu
1.98% 5.0265ms 239 21.031us 10.208us 40.384us vloc_psi_k_gpu_464_gpu
1.86% 4.7254ms 219 21.577us 12.319us 33.824us void sytd2_upper_cta<double2, double, int=4>(int, double2*, unsigned long, double*, double*, double2*)
1.71% 4.3307ms 219 19.774us 19.743us 20.960us laxlib_cdiaghg_gpu_349_gpu
1.64% 4.1660ms 239 17.430us 17.248us 19.488us vloc_psi_k_gpu_477_gpu
1.48% 3.7585ms 1 3.7585ms 3.7585ms 3.7585ms force_corr_gpu_103_gpu
1.45% 3.6914ms 239 15.444us 15.264us 16.704us vloc_psi_k_gpu_456_gpu
1.40% 3.5579ms 2320 1.5330us 1.4080us 13.056us [CUDA memcpy DtoH]
1.36% 3.4570ms 219 15.785us 15.712us 16.352us laxlib_cdiaghg_gpu_317_gpu
1.34% 3.4099ms 159 21.445us 21.280us 23.136us g_psi_gpu_53_gpu
1.28% 3.2424ms 1979 1.6380us 1.2160us 13.120us [CUDA memcpy HtoD]
1.22% 3.0915ms 552 5.6000us 4.2880us 9.0560us void composite_2way_fft<unsigned int=20, unsigned int=4, unsigned int=16, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=5, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>)
1.19% 3.0239ms 239 12.652us 10.816us 14.240us h_psi__gpu_158_gpu
1.14% 2.8893ms 219 13.193us 9.2160us 20.192us void trsm_ln_up_kernel<double2, unsigned int=32, unsigned int=32, unsigned int=4, bool=0>(int, int, double2 const *, int, double2*, int, double2, double2 const *, int, int*)
1.12% 2.8463ms 1095 2.5990us 2.4960us 3.2640us copy_info_kernel(int, int*)
1.06% 2.6975ms 170 15.867us 15.647us 16.544us init_us_2_base_gpu_119_gpu
1.02% 2.5845ms 40 64.612us 64.320us 72.960us stres_us_k_gpu_702_gpu
1.01% 2.5699ms 159 16.162us 16.096us 16.704us reorder_evals_cevecs_707_gpu
0.99% 2.5005ms 40 62.512us 62.240us 70.656us stres_us_k_gpu_817_gpu
0.97% 2.4644ms 159 15.499us 15.232us 16.576us cegterg_gpu_427_gpu
0.96% 2.4360ms 70 34.799us 34.720us 35.424us cegterg_gpu_265_gpu
0.89% 2.2453ms 40 56.131us 55.840us 63.040us stres_knl_gpu_100_gpu
0.86% 2.1855ms 40 54.636us 54.463us 56.832us stres_us_k_gpu_543_gpu
0.82% 2.0773ms 243 8.5480us 7.2320us 11.904us fft_scalar_cufft_cfft3d_gpu_586_gpu
0.82% 2.0749ms 280 7.4100us 7.3280us 7.8720us get_rho_gpu_954_gpu
0.80% 2.0350ms 212 9.5990us 9.4080us 10.016us dp_dev_memcpy_c2d_770_gpu
0.71% 1.7922ms 689 2.6010us 2.4960us 3.7440us void scal_kernel<double2, double2, int=1, bool=1, int=5, int=4, int=4, int=4>(cublasTransposeParams<double2>, double2 const *, double2*, double2 const *)
0.70% 1.7640ms 159 11.094us 10.912us 11.744us cegterg_gpu_376_gpu
0.67% 1.7032ms 508 3.3520us 3.1670us 4.4480us void reduce_1Block_kernel<double, int=128, int=7, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>>(double const *, double, double, int, double const *, double, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>, cublasPointerMode_t, cublasLtEpilogue_t, cublasGemvTensorStridedBatched<biasType<cublasGemvTensorStridedBatched<double>::value_type, double>::type const >)
0.67% 1.7000ms 508 3.3460us 3.1680us 4.8640us void dot_kernel<double, int=128, int=0, cublasDotParams<cublasGemvTensor<double const >, cublasGemvTensorStridedBatched<double>>>(double const )
0.66% 1.6738ms 40 41.843us 41.760us 42.944us stres_us_k_gpu_617_gpu
0.66% 1.6658ms 159 10.476us 10.432us 11.136us reorder_evals_cevecs_700_gpu
0.54% 1.3789ms 219 6.2960us 5.1840us 8.9280us void potrf_alg2_cta_upper<double2, double, int=32>(int, int, double2*, unsigned long, int*)
0.53% 1.3506ms 170 7.9440us 7.8400us 8.6080us init_us_2_base_gpu_134_gpu
0.53% 1.3341ms 438 3.0450us 2.4960us 188.80us void lapack_identity_kernel<double, int=8>(int, int, double*, int)
0.52% 1.3279ms 219 6.0630us 5.0880us 8.6400us void trsm_right_kernel<double2, int=256, int=4, bool=0, bool=0, bool=0, bool=1, bool=0>(cublasTrsmParams<double2>, double2, double2 const *, int)
0.52% 1.3185ms 219 6.0200us 4.3200us 8.2880us void ormql_cta_kernel<double2, int=4, int=1>(int, int, int, double2 const *, unsigned long, double2 const *, double2*, unsigned long, int, int, int, int)
0.52% 1.3185ms 90 14.649us 14.496us 15.072us dylmr2_gpu_78_gpu
0.51% 1.2925ms 209 6.1840us 6.1440us 6.4640us dp_dev_memcpy_r1d_270_gpu
0.50% 1.2803ms 71 18.033us 17.983us 18.687us cegterg_gpu_615_gpu
0.50% 1.2592ms 438 2.8740us 2.7200us 3.8720us void kernel_extract_uplo_A<double2, int=5, int=3>(int, double2 const *, unsigned long, double2*, unsigned long, int)
0.50% 1.2586ms 163 7.7210us 7.5840us 8.0000us dp_dev_memset_c2d_1851_gpu
0.47% 1.1830ms 408 2.8990us 2.4960us 3.7440us __pgi_dev_cumemset_16n
0.47% 1.1818ms 80 14.772us 14.496us 17.216us g2_kin_gpu_40_gpu
0.44% 1.1150ms 169 6.5970us 5.6960us 9.1200us void trsm_left_kernel<double2, int=256, int=4, bool=0, bool=1, bool=1, bool=1, bool=0>(cublasTrsmParams<double2>, double2, double2 const *, int)
0.42% 1.0619ms 52 20.420us 18.944us 27.136us volta_zgemm_32x32_cn
0.42% 1.0610ms 70 15.157us 15.104us 16.032us sum_band_k_gpu_837_gpu
0.40% 1.0224ms 219 4.6680us 4.2240us 5.4720us void lansy_M_stage1<double2, double, int=8>(int, double2 const *, unsigned long, double*, int)
0.40% 1.0046ms 90 11.162us 11.040us 11.488us dylmr2_gpu_90_gpu
0.39% 984.57us 80 12.307us 12.223us 12.928us atomic_wfc___gpu_396_gpu
0.37% 946.72us 80 11.833us 11.744us 12.224us compute_deff_gpu_41_gpu
0.36% 909.82us 689 1.3200us 1.2480us 2.0160us [CUDA memset]
0.34% 856.35us 219 3.9100us 3.8080us 5.6000us void batch_symmetrize_kernel<double2, int=5, int=3>(int, double2*, unsigned long, __int64, int, int)
0.34% 855.00us 30 28.500us 28.352us 29.568us gen_us_dy_gpu_229_gpu
0.33% 842.37us 90 9.3590us 9.2480us 9.8240us dylmr2_gpu_101_gpu
0.33% 827.00us 90 9.1880us 9.0230us 10.048us dylmr2_gpu_60_gpu
0.30% 772.22us 219 3.5260us 3.4870us 4.8000us void lansy_M_stage2<double, int=8>(int, double*)
0.29% 745.95us 30 24.865us 24.831us 25.120us gen_us_dy_gpu_198_gpu
0.28% 703.80us 30 23.460us 23.423us 24.128us gen_us_dy_gpu_146_gpu
0.27% 690.78us 219 3.1540us 3.0720us 3.7120us void lapack_lacpy_kernel<double, int=8>(int, int, double const *, int, double*, int, int, int)
0.27% 685.82us 219 3.1310us 3.0390us 3.6480us void laed0_phase1_kernel<double, int=8>(int, double const *, int, int const *, double*, int, int, int)
0.25% 644.64us 219 2.9430us 2.8800us 3.9040us void stedcx_convert_kernel<double2, double, int=8>(int, int, double const *, int, double2*, int)
0.25% 642.30us 219 2.9320us 2.8800us 3.2960us void lacpy_kernel<double2, double2, int=5, int=3>(int, int, double2 const *, unsigned long, double2*, unsigned long, int, int)
0.25% 623.36us 219 2.8460us 2.8150us 3.2000us potrf_alg2_reset_info(int*)
0.24% 598.37us 219 2.7320us 2.6880us 2.8800us dtrsv_init_up(int*, int)
0.24% 596.93us 219 2.7250us 2.6880us 3.2320us potrf_alg2_set_info(int, int, int*)
0.22% 558.62us 30 18.620us 18.432us 18.911us gen_us_dy_gpu_85_gpu
0.21% 525.28us 70 7.5030us 7.4560us 7.6160us diag_bands_k_693_gpu
0.18% 457.21us 30 15.240us 15.136us 15.968us force_us_gpu_104_gpu
0.18% 456.89us 50 9.1370us 8.9910us 14.144us void trsm_lt_up_kernel<double2, unsigned int=32, unsigned int=32, unsigned int=4, bool=0, bool=1>(int, int, double2 const *, int, double2*, int, double2, double2 const *, int, int*)
0.18% 454.24us 30 15.141us 15.040us 17.024us gen_us_dy_gpu_185_gpu
0.18% 453.47us 70 6.4780us 6.4320us 6.7520us dp_dev_memset_r2d_1431_gpu
0.17% 437.12us 20 21.856us 21.632us 23.712us atomic_wfc_gpu_108_gpu
0.17% 427.58us 20 21.379us 20.992us 23.104us interp_atwfc_gpu_30_gpu
0.15% 381.34us 30 12.711us 12.608us 13.184us gen_us_dy_gpu_102_gpu
0.14% 362.69us 60 6.0440us 5.9510us 6.2720us gen_us_dy_gpu_220_gpu
0.13% 334.53us 78 4.2880us 3.9040us 5.5360us void gemv2N_kernel<int, int, double2, double2, double2, double2, int=128, int=16, int=4, int=4, int=1, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>, double2>>(double2 const )
0.12% 298.91us 1 298.91us 298.91us 298.91us compute_dvloc_gpum_compute_dvloc_gpu_
0.10% 255.07us 10 25.507us 25.280us 27.392us gen_us_dj_gpu_206_gpu
0.10% 248.74us 10 24.873us 24.800us 25.216us gen_us_dj_gpu_173_gpu
0.10% 243.93us 10 24.393us 24.256us 25.440us gen_us_dj_gpu_119_gpu
0.08% 204.67us 30 6.8220us 6.7520us 6.9760us gen_us_dy_gpu_112_gpu
0.08% 198.24us 52 3.8120us 3.5520us 4.9280us void splitKreduce_kernel<double2, double2, double2, double2>(cublasSplitKParams<double2>, double2 const *, double2 const *, double2*, double2 const *, double2 const *, double2 const *)
0.08% 197.82us 52 3.8040us 3.6480us 4.7040us void gemvNSP_kernel<double2, double2, double2, double2, int=1, int=32, int=4, int=1024, cublasGemvParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>, double2>>(double2 const )
0.08% 194.37us 10 19.436us 19.072us 20.832us init_wfc_gpu_295_gpu
0.07% 186.46us 10 18.646us 18.592us 18.816us gen_us_dj_gpu_73_gpu
0.07% 182.18us 10 18.217us 18.176us 18.399us stres_knl_gpu_84_gpu
0.07% 173.02us 20 8.6510us 8.6400us 8.8320us cegterg_gpu_288_gpu
0.07% 172.42us 20 8.6200us 8.5120us 9.0560us stres_us_gpu_131_gpu
0.07% 171.01us 10 17.100us 17.024us 17.376us atomic_wfc_gpu_70_gpu
0.06% 152.13us 10 15.212us 15.071us 16.384us gen_us_dj_gpu_160_gpu
0.05% 137.73us 50 2.7540us 2.7200us 2.9760us dtrsv_init(int*)
0.05% 135.39us 2 67.695us 64.959us 70.432us force_corr_gpu_124_gpu
0.05% 123.78us 20 6.1880us 5.8880us 6.7520us void gemv2T_kernel_val<int, int, double2, double2, double2, double2, int=128, int=16, int=4, int=4, bool=1, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>, double2>>(double2 const , double2, double2)
0.05% 120.93us 20 6.0460us 5.9520us 6.3680us gen_us_dj_gpu_197_gpu
0.04% 103.62us 10 10.361us 10.304us 10.848us stres_us_gpu_91_gpu
0.04% 96.448us 7 13.778us 13.568us 14.176us dfunct_gpum_newd_gpu_311_gpu
0.04% 94.400us 1 94.400us 94.400us 94.400us stres_ewa_gpu_155_gpu
0.03% 72.992us 10 7.2990us 7.1360us 8.4160us init_wfc_gpu_391_gpu
0.03% 72.800us 2 36.400us 34.432us 38.368us force_lc_gpu_119_gpu
0.03% 72.768us 1 72.768us 72.768us 72.768us stres_har_gpu_77_gpu
0.03% 69.888us 10 6.9880us 6.8480us 7.4240us atomic_wfc_gpu_85_gpu
0.03% 67.520us 1 67.520us 67.520us 67.520us stres_loc_gpu_155_gpu
0.02% 59.712us 10 5.9710us 5.8880us 6.2080us rotate_wfc_k_gpu_132_gpu
0.01% 24.384us 6 4.0640us 3.7760us 4.9600us void reduce_1Block_kernel<double2, int=64, int=6, cublasGemvTensorStridedBatched<double2>, cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>>(double2 const *, double2, double2, int, double2 const *, double2, cublasGemvTensorStridedBatched<double2>, double2 const , cublasPointerMode_t, cublasLtEpilogue_t, cublasGemvTensorStridedBatched<biasType<double2 const value_type, double2>::type const >)
0.01% 24.224us 6 4.0370us 3.7760us 4.8960us void dot_kernel<double2, int=64, int=1, cublasDotParams<cublasGemvTensorStridedBatched<double2 const >, cublasGemvTensorStridedBatched<double2>>>(double2 const )
0.01% 21.568us 1 21.568us 21.568us 21.568us stres_loc_gpu_98_gpu
0.01% 15.264us 6 2.5440us 2.4640us 2.8160us __pgi_dev_cumemset_4n
0.00% 9.7280us 1 9.7280us 9.7280us 9.7280us dvloc_of_g_gpu_184_gpu
API calls: 56.54% 877.99ms 1715 511.95us 489ns 409.99ms cudaFree
19.84% 308.14ms 900 342.37us 1.4400us 295.87ms cudaDeviceSynchronize
7.03% 109.13ms 20152 5.4150us 4.5100us 310.44us cudaLaunchKernel
4.31% 66.931ms 1542 43.405us 4.6000us 3.8148ms cudaMemcpy
2.19% 34.061ms 2479 13.739us 3.8100us 180.48us cudaMemcpyAsync
2.12% 32.959ms 2557 12.889us 4.6510us 239.27us cudaEventSynchronize
1.43% 22.244ms 20 1.1122ms 822.92us 2.3907ms cuDeviceTotalMem
1.11% 17.296ms 6645 2.6020us 749ns 186.38us cudaEventRecord
0.93% 14.380ms 1744 8.2450us 1.8290us 1.3001ms cudaMalloc
0.75% 11.621ms 1977 5.8780us 149ns 1.6835ms cuDeviceGetAttribute
0.57% 8.8800ms 20143 440ns 330ns 287.69us cudaDeviceGetAttribute
0.49% 7.6111ms 1656 4.5960us 4.0700us 31.689us cuLaunchKernel
0.33% 5.1501ms 10579 486ns 330ns 239.62us cudaGetDevice
0.29% 4.4656ms 6 744.27us 448.31us 2.1013ms cudaGetDeviceProperties
0.28% 4.4199ms 10835 407ns 150ns 2.2176ms cudaGetLastError
0.25% 3.8660ms 1384 2.7930us 1.8200us 8.4200us cudaStreamSynchronize
0.20% 3.1513ms 689 4.5730us 3.3890us 20.390us cudaMemsetAsync
0.19% 3.0171ms 2557 1.1790us 1.0100us 11.680us cudaEventElapsedTime
0.15% 2.3771ms 256 9.2850us 1.9900us 152.75us cudaSetDevice
0.15% 2.2786ms 1524 1.4950us 780ns 12.790us cudaEventQuery
0.14% 2.1870ms 145 15.083us 7.2200us 21.080us cudaMemcpy2D
0.11% 1.7847ms 147 12.140us 4.5000us 738.97us cudaMallocHost
0.11% 1.7611ms 2336 753ns 469ns 12.960us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.09% 1.3806ms 20 69.028us 41.230us 387.93us cuDeviceGetName
0.09% 1.3584ms 133 10.213us 4.9500us 107.07us cudaMemcpyToSymbol
0.09% 1.3446ms 508 2.6460us 2.2900us 14.350us cudaFuncGetAttributes
0.05% 771.33us 146 5.2830us 3.7500us 20.409us cudaFreeHost
0.04% 625.29us 44 14.211us 1.3800us 205.11us cudaStreamCreate
0.02% 380.08us 552 688ns 510ns 3.6400us cudaStreamIsCapturing
0.02% 359.66us 44 8.1740us 3.8090us 92.571us cudaStreamDestroy
0.01% 195.34us 267 731ns 620ns 15.100us cudaEventCreate
0.01% 170.44us 562 303ns 200ns 1.2400us cuCtxPushCurrent
0.01% 158.23us 562 281ns 200ns 810ns cuCtxPopCurrent
0.01% 116.94us 146 800ns 480ns 2.9910us cudaPointerGetAttributes
0.00% 54.041us 90 600ns 460ns 2.8110us cudaEventCreateWithFlags
0.00% 40.090us 3 13.363us 2.4000us 32.530us cudaStreamCreateWithFlags
0.00% 20.707us 24 862ns 250ns 6.3000us cuDeviceGet
0.00% 18.040us 4 4.5100us 1.8300us 9.0200us cuDeviceGetPCIBusId
0.00% 17.489us 4 4.3720us 2.5690us 9.3200us cuInit
0.00% 16.104us 45 357ns 180ns 1.9900us cudaGetFuncBySymbol
0.00% 13.147us 8 1.6430us 1.3110us 3.2490us cudaEventDestroy
0.00% 5.2070us 20 260ns 150ns 580ns cuDeviceGetUuid
0.00% 3.3580us 7 479ns 230ns 940ns cuDeviceGetCount
0.00% 2.6790us 10 267ns 180ns 360ns cuCtxGetCurrent
0.00% 1.2700us 2 635ns 190ns 1.0800us cudaGetDeviceCount
0.00% 1.1300us 4 282ns 240ns 380ns cuDriverGetVersion
0.00% 920ns 5 184ns 170ns 200ns cuCtxGetDevice
0.00% 309ns 1 309ns 309ns 309ns cudaDriverGetVersion
0.00% 200ns 1 200ns 200ns 200ns cudaRuntimeGetVersion
Compile with ICC
Compiling with intel icc with fftw library.
spack load intel-oneapi-compilers@2021.1.2
spack load intel-parallel-studio@cluster-2020.2
spack load netlib-lapack@3.9.1/nbc
spack load openmpi@4.1.1/jip
./configure --prefix=/home/qe/fftw-3.3.9 F77=ifort CC=icc CFLAGS="-O3 -g -march=native" FFLAGS="-O3 -g" -enable-openmp
make -j 128 all
If the option -march=native
is added in FFLAGS, ifort will throw an error
ifort: error #10106: Fatal error in /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/compiler/2021.1.2/linux/bin/intel64/../../bin/intel64/fortcom, terminated by segmentation violation
Tuning with different number of MPI processes and OpenMP threads on one node, 32 processes with 8 threads each got the best performance in testcase AUSURF112.
PWSCF : 37m 3.31s CPU 4m46.48s WALL
Compile with AOCC
spack load aocc@3.0.0/46t
spack load amdfftw@3.0
spack load openmpi@4.1.1/nqq
export F90=flang
export F77=flang
export FC=flang
export CC=clang
export CXX=clang++
./configure --enable-parallel --enable-openmp CFLAGS="-O3 -g -march=znver2" FFLAGS="-O3 -g -march=znver2" FFT_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3.a /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_omp.a /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_threads.a" BLAS_LIBS=/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdblis-3.0-avcgn4ja67j4wz5euv6usv4rt2okvytg/lib/libblis-mt.a LAPACK_LIBS=/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdlibflame-3.0-6tev4j6setn6jmojmydlnz3qi4bn5qrs/lib/libflame.a MPI_LIBS="-L/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/openmpi-4.1.1-nqqearshseiwkncy5roqcqij5dieen3p/lib" DFLAGS="-D__FFTW3 -D__MPI" IFLAGS="-I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/include -I/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdblis-3.0-avcgn4ja67j4wz5euv6usv4rt2okvytg/include -I/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdlibflame-3.0-6tev4j6setn6jmojmydlnz3qi4bn5qrs/include -I/home/qe/q-e/include"
pitfall: qe configure does not recognize flang. Need to change F90=flang in make.inc manually.
This version cannot pass the test and AUSURF112 benchmark does not converge. (Errors may be brought by the libraries)
All done. ERROR: only 166 out of 221 tests passed.
Failed tests in:
/home/qe/q-e/test-suite/pw_b3lyp/
/home/qe/q-e/test-suite/pw_berry/
/home/qe/q-e/test-suite/pw_cluster/
/home/qe/q-e/test-suite/pw_electric/
/home/qe/q-e/test-suite/pw_lda+U/
/home/qe/q-e/test-suite/pw_lsda/
/home/qe/q-e/test-suite/pw_md/
/home/qe/q-e/test-suite/pw_metaGGA/
/home/qe/q-e/test-suite/pw_metal/
/home/qe/q-e/test-suite/pw_noncolin/
/home/qe/q-e/test-suite/pw_pawatom/
/home/qe/q-e/test-suite/pw_realspace/
/home/qe/q-e/test-suite/pw_relax/
/home/qe/q-e/test-suite/pw_scf/
/home/qe/q-e/test-suite/pw_spinorbit/
/home/qe/q-e/test-suite/pw_uspp/
/home/qe/q-e/test-suite/pw_vc-relax/
/home/qe/q-e/test-suite/pw_vdw/
/home/qe/q-e/test-suite/pw_workflow_relax_relax/
/home/qe/q-e/test-suite/pw_workflow_scf_dos/
/home/qe/q-e/test-suite/pw_workflow_vc-relax_dos/
/home/qe/q-e/test-suite/pw_workflow_vc-relax_scf/
starting charge 1230.69946, renormalised to 1232.00000
negative rho (up, down): 3.043E+00 0.000E+00
Starting wfcs are 1008 randomized atomic wfcs
[epyc.node1:216922] 127 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[epyc.node1:216922] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[epyc.node1:216922] 127 more processes have sent help message help-btl-vader.txt / knem permission denied
total cpu time spent up to now is 22.9 secs
Self-consistent Calculation
iteration # 1 ecut= 25.00 Ry beta= 0.70
Davidson diagonalization with overlap
ethr = 1.00E-02, avg # of iterations = 5.0
Threshold (ethr) on eigenvalues was too large:
Diagonalizing with lowered threshold
Davidson diagonalization with overlap
ethr = 4.37E-04, avg # of iterations = 18.5
negative rho (up, down): 2.992E+00 0.000E+00
total cpu time spent up to now is 430.1 secs
total energy = -11423.48971757 Ry
estimated scf accuracy < 6.31636318 Ry
iteration # 2 ecut= 25.00 Ry beta= 0.70
Davidson diagonalization with overlap
ethr = 5.13E-04, avg # of iterations = 15.5
negative rho (up, down): 2.993E+00 0.000E+00
total cpu time spent up to now is 795.7 secs
total energy = -11408.37987998 Ry
estimated scf accuracy < 196.19698446 Ry
End of self-consistent calculation
convergence NOT achieved after 2 iterations: stopping
Writing output data file ./ausurf.save/
[epyc:216930:0:216930] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc7000)
==== backtrace (tid: 216930) ====
0 /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.3.0/ucx-1.10.1-xby34b5gbwxi5cknbevj4wlbs34hyri6/lib/libucs.so.0(ucs_handle_error+0x254) [0x7fd0b3b587d4]
1 /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.3.0/ucx-1.10.1-xby34b5gbwxi5cknbevj4wlbs34hyri6/lib/libucs.so.0(+0x269b7) [0x7fd0b3b589b7]
2 /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.3.0/ucx-1.10.1-xby34b5gbwxi5cknbevj4wlbs34hyri6/lib/libucs.so.0(+0x26c8e) [0x7fd0b3b58c8e]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fd0b4180730]
4 /home/qe/q-e/bin/pw.x() [0x11e3890]
5 /home/qe/q-e/bin/pw.x() [0x11e3e47]
6 /home/qe/q-e/bin/pw.x() [0x11ef0ce]
7 /home/qe/q-e/bin/pw.x() [0x117a124]
8 /home/qe/q-e/bin/pw.x() [0x9087e0]
9 /home/qe/q-e/bin/pw.x() [0x9085c7]
10 /home/qe/q-e/bin/pw.x() [0x9084f7]
11 /home/qe/q-e/bin/pw.x() [0x906c58]
12 /home/qe/q-e/bin/pw.x() [0x920797]
13 /home/qe/q-e/bin/pw.x() [0x682772]
14 /home/qe/q-e/bin/pw.x() [0x67ca67]
15 /home/qe/q-e/bin/pw.x() [0x6a889f]
16 /home/qe/q-e/bin/pw.x() [0x4c8406]
17 /home/qe/q-e/bin/pw.x() [0x18baa23]
18 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fd0b3fd109b]
19 /home/qe/q-e/bin/pw.x() [0x4c81da]
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node epyc exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Compile with GCC
Specify the mkl libraries manually.
spack load gcc@10.2.0/3xz
spack load openmpi@4.1.1/n46
./configure --enable-parallel --with-scalapack=yes --enable-openmp CFLAGS="-O3 -g -march=znver2" FFLAGS="-O3 -g -march=znver2 -fallow-argument-mismatch" FFT_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_omp.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib/libfftw3_threads.a" \
BLAS_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_gf_lp64.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_sequential.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_core.a" \
LAPACK_LIBS=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_lapack95_lp64.a \
SCALAPACK_LIBS="/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_scalapack_ilp64.a \
/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_blacs_openmpi_lp64.a" \
MPI_LIBS="-L/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/openmpi-4.1.1-n46i3ctamj3tnmnd7qfzhabdweajbgsn/lib" \
DFLAGS="-D__FFTW3 -D__MPI -D__SCALAPACK" \
IFLAGS="-I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/include -I/opt/spack/opt/spack/linux-debian10-zen2/aocc-3.0.0/amdblis-3.0-avcgn4ja67j4wz5euv6usv4rt2okvytg/include -I/home/qe/q-e/include"
Error to be fixed:
/usr/bin/ld: /opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-parallel-studio-cluster-2020.2-wouhr4mlxyn4ye5a5hpoas3s5evum5o3/mkl/lib/intel64/libmkl_core.a(mkl_memory_patched.o): undefined reference to symbol 'dlclose@@GLIBC_2.2.5' /usr/bin/ld: //lib/x86_64-linux-gnu/libdl.so.2: error adding symbols: DSO missing from command line collect2: error: ld returned 1 exit status
Misc
The library used in Q-E compiled by intel compiler:
BLAS_LIBS= -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
SCALAPACK_LIBS=-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64
FFT_LIBS= fftw-3.3.9
init_run : 158.19s CPU 21.00s WALL ( 1 calls)
electrons : 2063.54s CPU 264.73s WALL ( 1 calls)
Called by init_run:
wfcinit : 148.40s CPU 19.08s WALL ( 1 calls)
potinit : 1.84s CPU 0.24s WALL ( 1 calls)
hinit0 : 2.63s CPU 0.50s WALL ( 1 calls)
Called by electrons:
c_bands : 1937.22s CPU 247.62s WALL ( 3 calls)
sum_band : 116.01s CPU 15.64s WALL ( 3 calls)
v_of_rho : 2.32s CPU 0.30s WALL ( 3 calls)
newd : 12.90s CPU 1.87s WALL ( 3 calls)
mix_rho : 0.29s CPU 0.04s WALL ( 3 calls)
Called by c_bands:
init_us_2 : 1.41s CPU 0.29s WALL ( 14 calls)
cegterg : 1931.14s CPU 246.85s WALL ( 6 calls)
Called by *egterg:
cdiaghg : 304.65s CPU 38.94s WALL ( 81 calls)
h_psi : 656.99s CPU 84.10s WALL ( 85 calls)
s_psi : 145.97s CPU 18.38s WALL ( 85 calls)
g_psi : 0.31s CPU 0.05s WALL ( 77 calls)
Called by h_psi:
h_psi:calbec : 183.87s CPU 23.70s WALL ( 85 calls)
vloc_psi : 321.07s CPU 41.10s WALL ( 85 calls)
add_vuspsi : 150.67s CPU 19.07s WALL ( 85 calls)
General routines
calbec : 232.51s CPU 30.03s WALL ( 91 calls)
fft : 3.38s CPU 0.44s WALL ( 40 calls)
ffts : 0.93s CPU 0.15s WALL ( 6 calls)
fftw : 348.65s CPU 44.30s WALL ( 37782 calls)
interpolate : 0.26s CPU 0.03s WALL ( 3 calls)
davcio : 0.04s CPU 0.27s WALL ( 6 calls)
compiler option --march=native has no significant effect on speed
Try to run on two nodes, but failed
spack load intel-parallel-studio@cluster-2020.2
spack load openmpi@4.1.1/jip
spack load ucx/gji
mpirun --prefix /opt/spack/opt/spack/linux-debian10-zen2/intel-2021.1.2/openmpi-4.1.1-jipfb67ngxddcblg4rcsjuu47pskabrs/ -np 64 -hostfile ./hostfile -mca pml ucx -x UCX_TLS=rc_x,sm,self -x UCX_NET_DEVICES=mlx5_0:1 -x PATH -x LD_LIBRARY_PATH --oversubscribe /home/qe/q-e/bin/pw.x < ./ausurf.in
Set up the remote node when login non-interactively
add to .bashrc
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nonspack/ucx-1.10.0-gcc/lib . /opt/spack/share/spack/setup-env.sh spack load intel-parallel-studio@cluster-2020.2 spack load openmpi@4.1.1/jip spack load ucx/gji
A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find.
Host: epyc.node2 Framework: pml Component: ucx
Arm Forge MAP Result
Original code compiled by intel compiler with mkl. testcase AUSURF112.
Profiling : /home/qe/q-e/bin/pw.x -i ./ausurf.in
Allinea sampler : preload
MPI implementation : Auto-Detect (Open MPI)
* MPI arguments
* number of processes : 32
* number of nodes : 1
* Allinea MPI wrapper : preload (precompiled)
Input file : <stdin>
Working directory : /home/qe/benchmarks/sb/AUSURF112
Number of OpenMP threads : 8
Queue enabled : No
System config file : /home/qe/.allinea/system.config
OMP_NUM_THREADS (env var) : 8
Full target path : /home/qe/q-e/PW/src/pw.x
Launched from host : epyc.node1
Run started : Sat Aug 28 07:04:24 2021
Sampling started : Sat Aug 28 07:04:24 2021
Sampling stopped : Sat Aug 28 07:09:39 2021
Runtime : 354s
Sampled runtime : 315s
CPU floating-point: 38.2%
CPU memory access: 15.9%
CPU fp vector: 38.0%
CPU branch: 7.4%
Memory usage: 676MB
pcegterg_IP_ functions took a lot of time in synchronization mpi_barrier
which is even greater than the actual calculating time.
Compile Option
NVHPC
# LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/nvhpc-21.5-qrsvxrpkmqhxy2coxes2qzcfhirsy5uv/Linux_x86_64/21.5/comm_libs/openmpi4/openmpi-4.0.5/lib
spack load nvhpc@21.5/djb
spack load /tyv #hdf5
OneAPI
LD_LIBRARY_PATH=/opt/intel/oneapi/vpl/2021.4.0/lib:/opt/intel/oneapi/tbb/2021.3.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.3.1//libfabric/lib:/opt/intel/oneapi/mpi/2021.3.1//lib/release:/opt/intel/oneapi/mpi/2021.3.1//lib:/opt/intel/oneapi/mkl/2021.3.0/lib/intel64:/opt/intel/oneapi/itac/2021.3.0/slib:/opt/intel/oneapi/ipp/2021.3.0/lib/intel64:/opt/intel/oneapi/ippcp/2021.3.0/lib/intel64:/opt/intel/oneapi/ipp/2021.3.0/lib/intel64:/opt/intel/oneapi/dnnl/2021.3.0/cpu_dpcpp_gpu_dpcpp/lib:/opt/intel/oneapi/debugger/10.1.2/gdb/intel64/lib:/opt/intel/oneapi/debugger/10.1.2/libipt/intel64/lib:/opt/intel/oneapi/debugger/10.1.2/dep/lib:/opt/intel/oneapi/dal/2021.3.0/lib/intel64:/opt/intel/oneapi/compiler/2021.3.0/linux/lib:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/x64:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/emu:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2021.3.0/linux/lib/oclfpga/linux64/lib:/opt/intel/oneapi/compiler/2021.3.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/ccl/2021.3.0/lib/cpu_gpu_dpcpp:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/nvhpc-21.5-qrsvxrpkmqhxy2coxes2qzcfhirsy5uv/Linux_x86_64/21.5/compilers/lib:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/openssl-1.1.1k-v735mywfwhu5wwrc6rcppju7lxvoxegh/lib:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/zlib-1.2.11-aim3z46oucbopx4jmsvi6rj23psecql5/lib:/media/victoryang/NetDisk/Documents/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/ncurses-6.2-zdp3gdfsnlvphj7kpsgsfk3jvtxvuvz7/lib:/opt/intel/oneapi/mpi/2021.3.1//lib/release/
pitfalls
- https://github.com/MPAS-Dev/MPAS-Model/issues/554
- https://forums.developer.nvidia.com/t/problem-with-nvfortran-and-r/155366
- LibGOMP not IMPLEMENTED: fftw/scalapack/hdf5/elpa is not dependent on the compiler's lib.
performance
#if defined(__GPU_MPI) ierr = cudaDeviceSynchronize() ! This syncs __GPU_MPI case CALL bcast_integer_gpu( msg_d, msglen, source, group ) RETURN ! Sync done by MPI call (or inside bcast_xxx_gpu)```bash nvfortran 21.2-0 LLVM 64-bit target on x86-64 Linux -tp zen NVIDIA Compilers and Tools Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
1. GPU single thread
```bash
real 1m51.316s
user 51m9.972s
sys 4m59.190s
- GPU 4 thread
real 1m34.486s
user 2m12.550s
- 4 GPU 4 threads
real 6m26.432s
user 4h20m2.947s
sys 4h24.789s
- 8 GPU 2 node 4 threads
real 4m42.563s
user 1h24m6.227s
sys 2h0m4.267s
MPI + Cuda seems to call diffent routines of GPU implementation, which communication always hold the bounds.
#pragma acc host_data use_device(s_buf) MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
...
#pragma acc update host(s_buf[0:size] ) MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
So we are going to try GPU direct MPI.
#if defined(__GPU_MPI)
ierr = cudaDeviceSynchronize() ! This syncs __GPU_MPI case
CALL bcast_integer_gpu( msg_d, msglen, source, group )
RETURN ! Sync done by MPI call (or inside bcast_xxx_gpu)
But CUBLAS and other GPU code is just fine for one thread.
#if defined(__CUDA)
USE cudafor
USE cublas
#endif
IMPLICIT NONE
SAVE
PRIVATE
REAL(DP) :: one, zero, two, minus_one, minus_two
PARAMETER ( one = 1.0d0, zero = 0.0d0, two = 2.0d0, minus_one = -1.0d0 )
PARAMETER ( minus_two = -2.0d0 )
COMPLEX(DP) :: cone, czero, mcone
PARAMETER ( cone = (1.0d0, 0.0d0), czero = (0.0d0, 0.0d0) )
PARAMETER ( mcone = (-1.0d0, 0.0d0) )
REAL(DP) :: small = 1.0d-14
LOGICAL :: use_parallel_diag
PUBLIC :: sigset
PUBLIC :: tauset
PUBLIC :: rhoset
PUBLIC :: ortho_iterate
PUBLIC :: updatc, calphi_bgrp
PUBLIC :: mesure_diag_perf, mesure_mmul_perf
PUBLIC :: use_parallel_diag
PUBLIC :: bec_bgrp2ortho
REAL(DP), ALLOCATABLE DEVICEATTR :: tmp1(:,:), tmp2(:,:), dd(:,:), tr1(:,:), tr2(:,:)
REAL(DP), ALLOCATABLE DEVICEATTR :: con(:,:), x1(:,:)
CONTAINS
SUBROUTINE allocate_local_arrays(ldx)
INTEGER, INTENT(IN) :: ldx
IF( ALLOCATED( tr1 ) ) THEN
IF( SIZE( tr1, 1 ) /= ldx ) THEN
DEALLOCATE( tmp1, tmp2, dd, x1, con )
DEALLOCATE( tr1, tr2 )
END IF
END IF
IF( .NOT. ALLOCATED( tr1 ) ) THEN
ALLOCATE( tr1(ldx,ldx), tr2(ldx,ldx) )
ALLOCATE( tmp1(ldx,ldx), tmp2(ldx,ldx), dd(ldx,ldx), x1(ldx,ldx), con(ldx,ldx) )
END IF
END SUBROUTINE allocate_local_arrays
SUBROUTINE deallocate_local_arrays()
IF( ALLOCATED( tr1 ) ) DEALLOCATE( tr1 )
IF( ALLOCATED( tr2 ) ) DEALLOCATE( tr2 )
IF( ALLOCATED( tmp1 ) ) DEALLOCATE( tmp1 )
IF( ALLOCATED( tmp2 ) ) DEALLOCATE( tmp2 )
IF( ALLOCATED( dd ) ) DEALLOCATE( dd )
IF( ALLOCATED( x1 ) ) DEALLOCATE( x1 )
IF( ALLOCATED( con ) ) DEALLOCATE( con )
END SUBROUTINE deallocate_local_arrays
SUBROUTINE clear_unused_elements( x, idesc )
!
! Clear elements not involved in the orthogonalization
!
IMPLICIT NONE
REAL(DP) DEVICEATTR :: x(:,:)
INTEGER, INTENT(IN) :: idesc(:)
INTEGER :: nr, nc, i, j
INCLUDE 'laxlib.fh'
IF( idesc(LAX_DESC_ACTIVE_NODE) < 0 ) then
x = 0.0d0
ELSE
nr = idesc(LAX_DESC_NR)
nc = idesc(LAX_DESC_NC)
!$cuf kernel do(2) <<<*,*>>>
do j = nc + 1, SIZE( x, 2 )
do i = 1, SIZE( x, 1 )
x( i, j ) = 0.0d0
end do
end do
!$cuf kernel do(2) <<<*,*>>>
do j = 1, SIZE( x, 2 )
do i = nr + 1, SIZE( x, 1 )
x( i, j ) = 0.0d0
end do
end do
END IF
END SUBROUTINE
ramBLe
turn off hyperthreading
sudo su
echo off > /sys/devices/system/cpu/smt/control
/home/opc/ramBLe
boost 1.70.0 & mvapich2.3.3
turn off hyperthreading
sudo su
echo off > /sys/devices/system/cpu/smt/control
Gdrive
wget https://github.com/prasmussen/gdrive/releases/download/2.1.1/gdrive_2.1.1_linux_amd64.tar.gz
tar -zxvf gdrive_2.1.1_linux_amd64.tar.gz
wget https://forensics.cert.org/cert-forensics-tools-release-el7.rpm
sudo rpm -Uvh cert-forensics-tools-release*rpm
sudo yum --enablerepo=forensics install musl-libc -y
init env values
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/nfs/cluster/boost_1_70_0/stage/lib
source /home/opc/ramBLe/env.sh
mpi
source /opt/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-redhat7.9-x86_64/hpcx-init.sh
hpcx_load
mpirun -np 4 --display-map --map-by node -x MXM_RDMA_PORTS=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1
run
mpirun -np 144 --display-map --hostfile hostfiles -x MXM_RDMA_PORTS=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 ./ramble -f test/coronary.csv -n 6 -m 1841 -d -o test/coronary.dot
[opc@inst-dahrf-splendid-walrus ramBLe]$ cat hostfiles
hpc-node-1 slots=36
hpc-node-2 slots=36
hpc-node-3 slots=36
hpc-node-4 slots=36
tab is $'\t'
python run experinment
at /nfs/cluster/ramBle_hpcg
python common/scripts/ramble_experiments.py \
-p 16 -r 1 -a gs -d /nfs/scratch/C1_discretized.tsv -s '\t' -v \
--results result\c1.csv
mpirun -np 144 --display-map --hostfile hostfiles -x MXM_RDMA_PORTS=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 ./ramble -f /nfs/scratch/C1_discretized.tsv -m 29150 -n 5164 -s $'\t' -v -i -d -o test/c1.dot
mpirun -np 1 \
--display-map \
--hostfile hostfiles \
-x MXM_RDMA_PORTS=mlx5_0:1 \
-mca btl_openib_if_include mlx5_0:1 \
-x UCX_NET_DEVICES=mlx5_0:1 \
./ramble -f /nfs/scratch/C1_discretized.tsv -s $'\t' \
-n 29150 -m 5164 \
-c -v -i -d -o test/c1_2.dot >> result/hp_1
mpirun -np 144 \
--hostfile hostfiles \
-x MXM_RDMA_PORTS=mlx5_0:1 \
-mca btl_openib_if_include mlx5_0:1 \
-x UCX_NET_DEVICES=mlx5_0:1 \
./ramble -f test/coronary.csv -s ',' -n 6 -m 1841 -d -o test/coronary.dot
Auto Run script
Gdrive
gdrive download 1UdrvrUPBQRjQafeOn5gHENz9wCrOrX-F # ramBLe_hpcx.tar.gz
gdrive download 1QmW1RF6mvnepQ3hawMNK46MoDRNq8YGx # boost_1_70_0_compiled.tar.gz
Install
Lib
just add the code into SConstruct
to tell scons
where is the boost lib is.
libPaths.append("/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/boost-1.70.0-m4ttgcfqixwe22z5kz7bpp7mbqdspdbg/lib")
cppPaths.append("/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/boost-1.70.0-m4ttgcfqixwe22z5kz7bpp7mbqdspdbg/include")
Cardioid
Repo: https://github.com/LLNL/cardioid
编译
如果使用 mfem,可能需要手动指定其路径。
自动编译
由于上游的 cardioid 安装包存在编译时会卡死的问题,因此需要手动修补安装文件。
首先运行 spack edit cardioid
命令,spack 将会启动文本编辑器。此后,在类 class Cardioid(CMakePackage)
的起始处加入以下内容
patch('https://gist.githubusercontent.com/KiruyaMomochi/cc4dfde7da51c3b11e45ab1079662693/raw/cardioid-cmake.patch',
sha256='27e2b01a2a181d7364cf786f9da31193407b1aa9c20d0175965a3c772cc7378b')
此后使用 spack -d install -v cardioid
继续编译。
Spack 手动编译
以 fish shell 为例。
source /opt/spack/share/spack/setup-env.fish
spack stage cardioid+cuda
spack cd cardioid+cuda
spack build-env cardioid+cuda fish
纯手动编译
TODO
问题解决
Seg Fault with jemalloc
Happens when -nd >= 4
SIGTERM after finishing the job with -np >= 60
Some issue in the openmpi@4.1.1/jip
Use the Intel MPI
spack load intel-oneapi-compilers@2021.1.2
export F90=ifort
export F77=ifort
export FC=ifort
export CC=icc
export LD_LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mkl/2021.2.0/lib/intel64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib:$LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib/release_mt:$LD_LIBRARY_PATH
export LIBRARY_PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/lib/release_mt:$LIBRARY_PATH
export PATH=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/bin:$PATH
export MPI_LIBS=-L/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/libc
./configure --enable-parallel --with-scalapack=yes --enable-openmp CFLAGS="-march=core-avx2 -fma -ftz -fomit-frame-pointer -g" FFLAGS="-O3 -march=core-avx2 -align array64byte -fma -ftz -fomit-frame-pointer -g" SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -mkl=parallel -lifcore" IFLAGS="-I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mkl/2021.2.0/include -I/home/qe/q-e/include -I/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/intel-oneapi-compilers-2021.1.2-7ah54yk3newzc6hdcs3glm63clwyzgs7/mpi/2021.2.0/include"
This version is slower than before. Observations about the testcase AUSURF112: Cannot utilize hyperthreading efficiently, with 128 processes bind to core with OMP=1, it is faster than any combinations of number of processes and OMP_NUM_THREADS that utilizes all the hyperthreads.
About fftw
When we specify FFT_LIBS to configure of quantum-espresso 6.6, fft related macro are not defined. If FFTW_INCLUDE is defined, __FFTW is defined. Changing to amdfftw does not influence the running time.
export FFT_LIBS=-L/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/lib
export FFTW_INCLUDE=/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/amdfftw-3.0-di7xmgpsu564qqvfhajkazsnk5kknxwd/include
调试 CMake 项目
To debug a CMake project:
- Find cmake command from spack's log
- Create a new directory to save build files
cd
to the path and run cmake command- append
--trace-source=[path to CMakeLists.txt]
to the cmake command
Use message
command to print a middle result. For more information see CMake doc.
Spack install 安装时间太久
使用 spack -d install -v [package name]
输出调试日志。
如果问题出现在 cmake 期间,可能是函数 interface_link_libraries
的问题。
该函数会递归的去寻找各个子项目的 include path,这时候相同的依赖会被多次 include。又因为 spack 环境中的 include path 很长,会生成超极长的include path(指数级),导致 cmake 卡死。
解决方式 1
在他递归生成时加入检查
foreach(lib ${libs})
list(FIND searched ${lib} lib_has_been_searched)
#message(SEND_ERROR "+++ ${lib} ${lib_has_been_searched}")
if (lib_has_been_searched EQUAL -1)
get_recursive_list(recursive_val ${lib} ${prop} ${searched})
foreach(val ${retval})
if(NOT recursive_val)
list(APPEND val ${recursive_val})
else()
if (val IN_LIST recursive_val)
#message("Duplicate val!")
else()
list(APPEND val ${recursive_val})
endif()
endif()
endforeach()
endif()
endforeach()
解决方式 2
使用以下补丁,在递归后删除重复项。
diff --git a/elec/CMakeLists.txt b/elec/CMakeLists.txt
index 4a526cb..ca92d2d 100644
--- a/elec/CMakeLists.txt
+++ b/elec/CMakeLists.txt
@@ -271,7 +271,7 @@ function(get_recursive_list retvar target prop)
list(APPEND searched ${target})
#message(SEND_ERROR "=== ${target} ${prop} ${searched}")
- set(${retval} "")
+ set(retval "")
get_property(propval TARGET ${target} PROPERTY ${prop} SET)
if (propval)
get_target_property(propval ${target} ${prop})
@@ -288,6 +288,10 @@ function(get_recursive_list retvar target prop)
endif()
endforeach()
+ if(NOT retval)
+ list(REMOVE_DUPLICATES retval)
+ endif()
+
set(${retvar} ${retval} PARENT_SCOPE)
#message(SEND_ERROR "--- ${target} ${prop} ${retval}")
endfunction()
无法连接上国际互联网
Set http proxy to 192.168.100.5:1082
, or use
proxychains -q [command]
Config proxy for git
Set proxy
git config --global http.proxy http://192.168.100.5:1082
Unset proxy
git config --global --unset http.proxy
Rules
Install
# Load gcc and openmpi:
module load gcc-9.2.0 module load mpi
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
# Accept license agreements and select the
# right install location, probably not your
# home directory since disk quota is limited. #
# Activate conda, if you skipped it's auto initialization source /path/to/your/conda/install/bin/activate
# Turn on the conda base environment
conda activate
# Install pytorch
conda install pytorch cudatoolkit=11.3 -c pytorch
# Install build dependencies for larcv3:
conda install cmake hdf5 scikit-build
# Install Tensorflow:
pip install tensorflow
# NOTE: if you don't install tensorflow, you need to pip install numpy!
# Clone larcv and install it:
git clone https://github.com/DeepLearnPhysics/larcv3.git cd larcv3
git submodule update --init
python setup.py build -j 64
python setup.py install
# Install mpi4py:
pip install --force-reinstall mpi4py --no-cache-dir
# Install horovod with tensorflow or if you want it with pytorch:
pip install --force-reinstall horovod --no-cache-dir
參數
mode.optimizer.gradient_accumulation <= 1
mode.optimizer.learning_rate=123.456
mode.optimizer.name = "rmsprop" "adam"
mode.weights_location -> load checkpoint
mode.no_summary_images
run.compute_mode = DPCPP #? data parallel Cpp intel MKL優化CPU
gradient_accumulation.....: 1
conf['mode']['optimizer']['learning_rate'] = 10.**random.uniform(-3.5, -2.5)
conf['mode']['optimizer']['loss_balance_scheme'] = random.choice(["none", "light", "focal"])
checkpoint_iteration........: 500
learning_rateloss_balance_scheme
SCC_21.yml
defaults:
- _self_
- network: SCC_21
- framework: torch
- mode: train
- data: real
data:
downsample: 0
run:
distributed: true
iterations: 500
compute_mode: GPU
aux_minibatch_size: ${run.minibatch_size}
aux_iterations: 10
id: ???
precision: float32
profile: false
output_dir: output/${framework.name}/${network.name}/${run.id}/
minibatch_size: 2
mode:
optimizer: adam
loss_balance_scheme: light
iotest
\[ \text{running number} / \text{iteration} = \text{minibatch} / \text{rank}\\ \text{throughput} = \frac{\text{all running number}}{\text{runing time}}\\ \text{throughput} = \frac{\text{running number} / \text{iteration}\times \text{iteration}}{\text{iteration}\times\text{average runing time}}\\ =\frac{ \text{minibatch} / \text{rank}}{\text{average runing time}} \\ =\frac{\text{minbatch}}{\text{rank}\times({\text{reading time + compute time})}} \]
SC 20
Rewind: https://victoryang00.cn/wordpress/2020/11/12/vscc20-%e6%80%bb%e7%bb%93/
Benchmark
这里放置HPC竞赛中 Benchmark 有关资料,文档部分用于对新生的培训。
.dat Specs
ASC20
HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
384 256 256
60
how to run
export PATH=/opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/ompi/bin:/opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2/:$PATH
export LD_LIBRARY_PATH=/opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/ompi/lib:$LD_LIBRARY_PATH
source /etc/profile.d/modules.sh
module use /opt/nonspack/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu18.04-x86_64/modulefiles
module load hpcx
mpirun --allow-run-as-root --hostfile host2_gpu4 --mca pml_base_verbose 100 --mca btl_base_verbose 100 --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 --mca orte_base_help_aggregate=0 -x xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80
SC21
HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
256 256 512
1800
how to run
see binder.sh
HPL .dat config file
ASC18
The following is the HPL .dat
configuration file template from ASC18.
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
67200 65280 62976 65280 96000 65280 38400 96000 102400 168960 153600 76800 142848 153600 142848 124416 96256 142848 124416 115200 110592 96256 Ns
1 # of NBs
384 768 384 768 1024 768 896 768 1024 512 384 640 768 896 960 1024 1152 1280 384 640 960 768 640 256 960 512 768 1152 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 1 2 1 Ps
1 2 2 4 Qs
16.0 threshold
1 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
2 0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
192 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
ASC20
The following is the HPL .dat
configuration file template from ASC20.
Machine Spec : 8 Tesla V100
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
2 # of problems sizes (N)
175104 178176 165888 168960 172032 175104 Ns
2 # of NBs
384 256 128 256 384 192 288 320 384 384 768 1024 768 896 768 1024 512 384 640 768 896 960 1024 1152 1280 384 640 960 768 640 256 960 512 768 1152 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
4 2 8 1 2 1 Ps
4 8 2 2 4 Qs
16.0 threshold
1 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
2 0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
192 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
SC21
The following is the HPL .dat
configuration file template from SC20.
Machine Spec : 8 Tesla A100
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
2 # of problems sizes (N)
346122 348122 352122 Ns
2 # of NBs
384 256 128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
4 2 8 1 2 1 Ps
4 8 2 2 4 Qs
16.0 threshold
1 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
2 0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
192 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
Binder
#!/bin/bash
cd $1
# Global settings
export UCX_RNDV_SCHEME=put_zcopy
export UCX_IB_PCI_RELAXED_ORDERING=on
export UCX_MEMTYPE_CACHE=n
export UCX_MAX_RNDV_RAILS=1
export UCX_RNDV_THRESH=8192
APP="$2"
me=`hostname`
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
case ${lrank} in
0)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
source ../source.sh
export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
1)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
2)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
3)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
4)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
5)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
6)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
7)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
8)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
source ../source.sh
export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
9)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
10)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
11)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
12)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
13)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
14)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
15)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
16)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
source ../source.sh
export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
17)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
18)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
19)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
20)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
21)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
22)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
23)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
24)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
source ../source.sh
export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
25)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
26)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
27)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
28)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
29)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
30)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
31)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
32)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
source ../source.sh
export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
33)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
34)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
35)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
36)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
37)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
38)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
39)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
40)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
source ../source.sh
export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
41)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
42)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
43)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
44)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
45)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
46)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
47)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
48)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
source ../source.sh
export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
49)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
50)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
51)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
52)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
53)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
54)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
55)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
56)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP"
source ../source.sh
export CUDA_VISIBLE_DEVICES=2; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
57)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP"
export CUDA_VISIBLE_DEVICES=3; numactl --cpunodebind=0 taskset -c 0-23 $APP
;;
58)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=7; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
59)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP"
export CUDA_VISIBLE_DEVICES=6; numactl --cpunodebind=2 taskset -c 48-71 $APP
;;
60)
#ldd $APP
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=0; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
61)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP"
export CUDA_VISIBLE_DEVICES=1; numactl --cpunodebind=1 taskset -c 24-47 $APP
;;
62)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=5; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
63)
echo "host=$me rank= $OMPI_COMM_WORLD_RANK lrank = $lrank cores=$CPU_CORES_PER_RANK bin=$APP"
#set GPU and CPU affinity of local rank
echo "export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP"
export CUDA_VISIBLE_DEVICES=4; numactl --cpunodebind=3 taskset -c 72-95 $APP
;;
esac
DevOps
这里放置有关HPC环境维护有关的资料。
BeeGFS
BeeGFS is a hardware-independent POSIX parallel file system developed with a strong focus on performance and designed for ease of use, simple installation, and management.
Please have a look at BeeGFS Architecture overview before continuing.
ℹ️ Note: For linux kernels 5.x
Currently, the BeeGFS kernel module is not compatible with the Linux kernel 5.x. We need to patch it manually.
Some work has been done by Build kernel module against kernel version 5.8.x and tobydarling/beegfs-7.1.4-kernel-5.6.4.
Insallation
Please follow the Quick Start Guide to install.
Here we will only give you additional notes, assuming the operating system is Debian 10.
Step 1: Package Download and Installation
- Find the last version from BeeGFS Package Repository.
- Find the link to repository file, it should be something like:
wherehttps://www.beegfs.io/release/beegfs_7.2.4/dists/beegfs-deb10.list
7.2.4
is the version number,deb10
is the distribution name & version. - Download and save the file to
/etc/apt/sources.list.d/beegfs.list
:curl -Lo /etc/apt/sources.list.d/beegfs.list <the download link>
- Update the package list:
apt-get update
- Install the package from the repository.
To avoid errors, you should only install the package you need. For example, you don't need to install
beegfs-mgmtd
if this machine is only a BeeGFS client.# only install the package you need! # management service apt-get install beegfs-mgmtd # metadata service; libbeegfs-ib is only required for RDMA apt install beegfs-meta libbeegfs-ib # storage service; libbeegfs-ib is only required for RDMA apt install install beegfs-storage libbeegfs-ib # client and command-line utils apt install beegfs-client beegfs-helperd beegfs-utils
- For your convenience, consider append beegfs binary path into
PATH
, which is/opt/beegfs/sbin/
.
Step 2: Client Kernel Module Autobuild
Since we are using RDMA and installed InfiniBand kernel modules from Mellanox OFED, we should use buildArgs
like this:
# /etc/beegfs/beegfs-client-autobuild.conf
buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1 OFED_INCLUDE_PATH=/usr/src/ofa_kernel/default/include
Step 3: Basic Configuration
Please read the official guide carefully first, or you will waste a lot of time.
请先完整阅读 官方教程, 不然你会浪费很多时间。
請先完整閱讀 官方指南, 否則你會浪費很多時間。
公式ガイドをよく読んでからでないと、多くの時間を無駄にしてしまいます。
Assuming we use such configuration:
epyc.node1
: management + metadata + storage + clientepyc.node2
: storage + client
We also assume you have appended /opt/beegfs/sbin/
to PATH
. Otherwise, you should use prepend this path to commands we used below.
Then on node1, the commands are:
# node1
# setup management service
beegfs-setup-mgmtd -p /geekpie/beegfs_mgmtd
# setup metadata service
beegfs-setup-meta -p /geekpie/beegfs_meta -m epyc.node1
# setup storage service
beegfs-setup-storage -p /geekpie/hpc/ -i 101 -m epyc.node1
# setup client
beegfs-setup-client -m epyc.node1
On node2, the commands are:
# node2
# setup storage service
beegfs-setup-storage -p /geekpie/hpc/ -i 201 -m epyc.node2
# setup client
beegfs-setup-client -m epyc.node2
If you setuped more than once, please manually check configuration files since there may be some error.
Step 4: Service Setup
With the same assumption as above, we can start the services on node1 and node2:
# node1
# start services
systemctl start beegfs-mgmtd beegfs-meta beegfs-storage beegfs-helperd beegfs-client
# node2
# start services
systemctl start beegfs-storage beegfs-helperd beegfs-client
Step 5: Check Connectivity
We can check the connectivity using these commands:
beegfs-ctl --listnodes --nodetype=meta --nicdetails
beegfs-ctl --listnodes --nodetype=storage --nicdetails
beegfs-ctl --listnodes --nodetype=client --nicdetails
beegfs-net # Displays connections the client is actually using
beegfs-check-servers # Displays possible connectivity of the services
beegfs-df # Displays free space and inodes of storage and metadata targets
Check configuration
You can check the configuration by inspecting the config files, these files are located at /etc/beegfs/
.
Please notice that if you have setup BeeGFS twice, you may need to manually fix some configuration files, like beegfs-storage.conf
.
Grafana
数据源:telegraf
PBS
PBS 全称为 Portable Batch System,可以用来控制多个计算机上的任务。
常见使用方式如下,qcmd
为任意 PBS 命令:
# Name for the job
qcmd -N ramBLe_128
# Name of destination queue
qcmd -q GeekPie_CPU
# Required resources
qcmd -l nodes=4:ppn=32:amd
qcmd -l walltime=00:10:00
# Redirect stdout/stderr
qcmd -o /public/home/geekpie2/ramble-amd/ramBLe/submit/pbs-com-single-${PBS_JOBID}.out
qcmd -e /public/home/geekpie2/ramble-amd/ramBLe/submit/pbs-com-single-${PBS_JOBID}.err
常用命令
- qsub: 提交任务或启动交互式 Shell
- qstat: 查看任务状态
- 如果需要显示详细信息,可以使用 -f 参数
- 如果需要查看队列状态,可以使用 -Q 参数,后接队列名称
- 例如:
qstat -Qf GeekPie-CPU
- qdel: 删除任务
常用参数
参数 | 值 | 说明 |
---|---|---|
-q | 队列、服务器,或服务器上的队列 | 设置执行任务的主体 |
-N | 任务名称 | 设置任务名称 |
-l | 资源列表,使用逗号分隔 | 设置需要的资源,该命令可指定多次 |
-o | 输出文件 | stdout 内容将被重定向到该文件中,推荐使用绝对路径 |
-e | 错误文件 | stderr 内容将被重定向到该文件中,推荐使用绝对路径 |
参考文档
Slurm
本超算使用的是 Slurm,详细的配置可见 配合某戏精使用的 slurm 踩坑日记。
Singularity
伯克利出品的一个用户态放 docker 的地方。
Kanidm
Kanidm is an identity management server. We use it to manage users across multiple nodes.
We have two groups: a posix group geekpie-hpc
for everyone, and an admin group geekpie_admins
.
geekpie_admins
is used for manage accounts, which is a subgroup of:
idm_people_manage_priv
: create new personidm_group_write_priv
: add person into a groupidm_account_unix_extend_priv
: enable posix for a personidm_account_write_priv
: add ssh key to person
To begin with, export environment variable KANIDM_URL, and login with your geekpie_admins
user.
export KANIDM_URL="https://hpc-idm.geekpie.icu:8443"
kanidm login --name geekpie
Create a user
To create a user called John Smith, and add it to geekpie-hpc
group:
kanidm person create jsmith "John Smith"
kanidm person update jsmith --mail "jsmith@shanghaitech.edu.cn" # --legalname
kanidm group add-members geekpie-hpc jsmith
Then enable posix, set ssh key and password.
# In kanidm uid is the same as gid. I recommend you to manually allocate a gid.
# Please see https://github.com/geekpiehpc/AnsiblePlaybook/blob/main/group_vars/epyc.yml for old uids.
kanidm person posix set jsmith --gidnumber 2345 # --shell /usr/bin/bash
kanidm person ssh add-publickey jsmith id_rsa (cat ~/.ssh/id_rsa.pub)
# Don't need this the user do not need sudo
kanidm person posix set-password jsmith
Install
curl -L -o kanidm.deb https://github.com/kanidm/kanidm/releases/download/latest/kanidm_Ubuntu_22.04_1.1.0-beta.13-2023051108041ddac86_x86_64.deb
curl -L -o kanidm_unixd.deb https://github.com/kanidm/kanidm/releases/download/latest/kanidm-unixd_Ubuntu_22.04_1.1.0-beta.13-2023051108091ddac86_x86_64.deb
sudo dpkg -i kanidm.deb kanidm_unixd.deb
/etc/kanidm/unixd
:
pam_allowed_login_groups = ["geekpie-hpc"]
default_shell = "/usr/bin/bash"
home_alias = "name"
use_etc_skel = true
uid_attr_map = "name"
gid_attr_map = "name"
/etc/kanidm/config
:
uri = "https://hpc-idm.geekpie.icu:8443"
verify_ca = true
verify_hostnames = true
Edit /usr/share/pam-configs/kanidm-unixd
. Change priority to 0, otherwise you will be asked sudo password twice!
Restart services
sudo systemctl restart kanidm-unixd
sudo systemctl restart kanidm-unixd-tasks.service
Setup PAM and nsswitch
PAM
# THIS DIRTY HACK AND IS ACTUALLY UPSTREAM PACKAGING PROBLEM
sudo mv /etc/pam.d/kanidm-unixd /usr/share/pam-configs/
sudo pam-auth-update # check kanidm
For nssitwch, edit /etc/nsswitch.conf
:
passwd: files systemd kanidm
group: files systemd [SUCCESS=merge] kanidm
Then Add sudoers file
echo '%geekpie-hpc ALL=(ALL:ALL) ALL' | sudo EDITOR='tee -a' visudo /etc/sudoers.d/geekpie
Add ssh config by creating /etc/ssh/sshd_config.d/60-kanidm.conf
:
AuthorizedKeysCommand /usr/bin/env kanidm_ssh_authorizedkeys %u
AuthorizedKeysCommandUser nobody
Restart sshd service
sudo systemctl restart sshd.service
Oracle 集群采用 ansible 管理机器的开机
我们把上面的 ansible 自己魔改了一份放在 github
Oracle 被主办方自带高了一个 telegraf
我们用另一个端口和 binary 部署了一个 grafana 机器放实时机器信息,后来发现这东西延时只能当历史记录看,运维又老是掉SSD,Ceph over HDD 没有 replica 真不可靠,还是不弄了。
后台大致长这样:
ISC 所用到的机器密参
ISC21
ISC22
NSCC
Used in ISC21
Login is very old, which is E5-2690 with centos6. The file system is old Lustre without flock(), so you have to disable spack check. The schedule system is using OpenPBS>
$ cat outputfile.o
Checking The CPU and Network
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping: 2
CPU MHz: 1200.000
BogoMIPS: 5187.61
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-5
NUMA node1 CPU(s): 6-11
NUMA node2 CPU(s): 12-17
NUMA node3 CPU(s): 18-23
lspci | grep Mel
81:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
======================================================================================
Resource Usage on 2020-04-25 10:08:30.888043:
JobId: 9954616.wlm01
Project: 21120227
Exit Status: 0
NCPUs Requested: 1 NCPUs Used: 1
CPU Time Used: 00:00:00
Memory Requested: 100mb Memory Used: 0kb
Vmem Used: 0kb
Walltime requested: 00:10:00 Walltime Used: 00:00:00
Execution Nodes Used: (std1708:ncpus=1:mem=102400kb)
======================================================================================
DGX node is good for its hack on TDP of V100-16GB.
https://help.nscc.sg/wp-content/uploads/AI_System_QuickStart.pdf
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2794.907
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 4390.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 51200K
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr s
se sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4
_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp
_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hl
e avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mb
m_local dtherm ida arat pln pts md_clear flush_l1d
total used free shared buff/cache available
Mem: 503 58 340 0 105 442
Swap: 0 0 0
OFED-internal-4.4-2.0.7:
Ubuntu 18.04.2 LTS \n \l
Linux dgx4105 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Filesystem Size Used Avail Use% Mounted on
udev 252G 0 252G 0% /dev
tmpfs 51G 3.2M 51G 1% /run
/dev/sda2 440G 395G 22G 95% /
tmpfs 252G 12K 252G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 252G 0 252G 0% /sys/fs/cgroup
/dev/sda1 487M 6.1M 481M 2% /boot/efi
/dev/sdb1 7.0T 4.9T 1.8T 74% /raid
192.168.160.101:/home 3.4P 2.1P 1.4P 61% /home
[davidcho@nscc03 ~]$ cat !$
cat dgx4105.txt
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2794.907
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 4390.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 51200K
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
total used free shared buff/cache available
Mem: 503 58 340 0 105 442
Swap: 0 0 0
OFED-internal-4.4-2.0.7:
Ubuntu 18.04.2 LTS \n \l
Linux dgx4105 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Filesystem Size Used Avail Use% Mounted on
udev 252G 0 252G 0% /dev
tmpfs 51G 3.2M 51G 1% /run
/dev/sda2 440G 395G 22G 95% /
tmpfs 252G 12K 252G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 252G 0 252G 0% /sys/fs/cgroup
/dev/sda1 487M 6.1M 481M 2% /boot/efi
/dev/sdb1 7.0T 4.9T 1.8T 74% /raid
192.168.160.101:/home 3.4P 2.1P 1.4P 61% /home
192.168.156.29@o2ib,192.168.156.30@o2ib:/scratch 2.8P 1.8P 993T 65% /scratch
tmpfs 51G 0 51G 0% /run/user/0
Sat May 23 06:15:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |
| N/A 35C P0 43W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
06:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
85:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
86:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
0c:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
84:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
8b:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 12.23.1020
node_guid: ec0d:9a03:00a4:bbde
sys_image_guid: ec0d:9a03:00a4:bbde
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 251
port_lid: 1417
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_3
transport: InfiniBand (0)
fw_ver: 12.23.1020
node_guid: ec0d:9a03:00aa:2960
sys_image_guid: ec0d:9a03:00aa:2960
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 251
port_lid: 1419
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.23.1020
node_guid: ec0d:9a03:00aa:29b8
sys_image_guid: ec0d:9a03:00aa:29b8
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 251
port_lid: 1416
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 12.23.1020
node_guid: ec0d:9a03:00aa:2988
sys_image_guid: ec0d:9a03:00aa:2988
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 251
port_lid: 1422
port_lmc: 0x00
link_layer: InfiniBand
If NSCC is used again in the competition, contact your NTU or NUS student, they have four unique logins nodes to log in to the DGX nodes.
Bridges
Thor
Reference
- Performance Characteristics of the BlueField-2 SmartNIC
- https://developer.nvidia.com/blog/offloading-and-isolating-data-center-workloads-with-bluefield-dpu/
- https://docs.nvidia.com/networking/display/BlueFieldSWv35011563/Virtual+Switch+on+BlueField+DPU
Niagara
Used in ISC21
Login is the same as training nodes, the opened part is only Cascade lake and Ice lake, since the HPC/HPCG/HPCC requires both CPU and GPU, plz affine tasks to those nodes, normally after gia1000
.
$ ssh -Y lclmaoroph@niagara.scinet.utoronto.ca
Warning: Permanently added 'niagara.scinet.utoronto.ca' (RSA) to the list of known hosts.
Password:
===============================================================================
SciNet welcomes you to the NIAGARA supercomputer.
This is a Niagara login node. Use this node to develop and compile code,
to run short tests, and to submit computations to the scheduler.
Remember that /scratch is never backed-up.
Documentation: https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart
Support: support@scinet.utoronto.ca or niagara@computecanada.ca
===============================================================================
lclmaoroph@nia-login06:~$
Filesystem is GPFS, an IBM initiated FS that do not have full POSIX support, with only eventual consistency. But it's real fast for write and can scale up to 50PB.
SCRATCH Area is CVFS, a temporary fast cache for SCRATCH scripts.
The node start up script requires have some problem, no echo "bla"
in .bashrc but echo "bla" 1>&2
. PBS requires the bash to hack who you are. Maybe you could hack the quota or raise root permission. We attempted to use middleware FS and use bashrc to mount on the alocated nodes and eventually make it happen.
Azure
这里放置有关 Azure 云服务器 运维的信息。
CycleCloud
Azure CycleCloud is a tool for deploying HPC clusters in Azure and managing their workloads.
For a detailed introduction to CycleCloud, see CycleCloud Introduction. For a step-by-step guide to using CycleCloud, see Create, customize and manage an HPC cluster in Azure with Azure CycleCloud.
虽然我不想说……但最好的方法还是先去看看官方文档,了解一下它的功能,还有他的模板怎么写……新概念可能比较多,可以一边操作一边学习。
但请注意他的文档不全,或者有些内容过时了。如果你想确认最新的 CycleCloud 行为,用它开一台机器,把上面 /opt/cycle
的东西下下来,看里面的代码,是最直接的方法了。
Introduction: ...So what is CycleCloud?
……想说请这个先讲点历史。
CycleCloud 本来属于 Cycle Computing,后来被微软收购了。在被微软收购以前,CycleCloud 是可以在很多平台上使用的,包括 Amazon Web Services, Google Compute Engine 甚至是公司内部集群。
它的作用就是帮你方便的管理一堆 HPC 资源,举个例子我现在想在 AWS 上开 15 台机器作为我的 HPC Cluster,正常情况下我可能是一台一台开,聪明点的人知道写个脚本,一次申请所有资源。
不过就算这样你可能还要对每台机器做一些初始化,比如配置网卡、软件、用户等等。当然现在有一些高级工具比如 Cloud-Init 可以让你更方便的完成这些初始化工作,不过用它配置一堆软件还是显得有些吃力。 比较现代的解决方法是是用 Ansible,通过它和一个包含“节点内所有机器”的文件,它可以自动帮你去做各种奇奇怪怪的初始化工作(实际写起来有点像 GitHub Actions)。
还有一件事情就是你需要监控这些机器是否正常,如果不正常可能要把一些机器移除。此外可能也要根据工作负载动态调节(autoscaling)你的机器数量。这些云服务不一定会帮你做,虽然说动态调节会让人想到 k8s,但用 k8s 跑 HPC 可能现在还是有点要命吧?
说了这么多,CycleCloud 就是这么个工具,它帮你自动化的控制云上的 HPC 资源,让你只要点几下就可以建立一个好用、稳定的 HPC 集群。
……😢 实际上可能没有那么好用了,不过至少这应该是他们的愿景吧……。
Prerequisites: What do I need to know to learn it?
Template
CycleCloud 最重要的概念是模板,它是一个包含了一个集群(Cluster)所有软硬件需求的文件。CycleCloud 将根据这个文件创建集群,而你点加号看到的那些 scheduler, filesystem, etc. 都是模板,你甚至可以在 Azure GitHub 里找到这些模板。
模板使用的格式类似 ini
但又比 ini
高级。如果感觉难以下手的话,先去看看 toml 这个东西。
Cluster-init? Project?
你会留意到它文档里会提到 cluster-init
或者 project
这种东西。这两个是同义词。注意不要和 cloud-init
搞混了,这个与我们现在说的的无关。为了避免混淆,我们之后都用 project
这个词。
cloud-init
和project
都是初始化用的,但cloud-init
更底层,由 Azure 负责。project
则由 CycleCloud 负责。 换句话说,cloud-init
跑完以后,你的机器其实已经创建好了,这个时候在被 CycleCloud 做二次初始化,而二次初始化具体来说就是执行一系列的脚本和 Chef Cookbook。
在学习 CycleCloud 的同时你最好去了解一下 Chef Infra,所有的 Project 你剥开以后就是 Chef Cookbooks,而 Chef Cookbooks 和 Ansible Playbooks 差不多,都是用来初始化机器的。
学习 Chef 的时候你会发现你其实在学 Ruby……这也没有办法嘛。Ruby 的语法需要适应,尤其是之前没有接触过的话。另外尤其注意 CycleCloud 所使用的 Chef 版本(你可以在 /opt/cycle/
里找到它的二进制文件),避免用了旧版没有的东西。
小心 CycleCloud Chef 所用的 Ruby 版本对一些 SSL 网站的支援存在问题,见 https://bugs.ruby-lang.org/issues/15594。如果要下载东西,可能要用 http 或者自己写个 Chef Resource,在其中调用 Ruby 的
::URI.open
,并且设置ssl_verify_mode
为OpenSSL::SSL::VERIFY_NONE
。
Cloud-Init
CycleCloud 支援 Cloud—Init,但只支援一半。
当你想用 Mime Multi Part Archive 时,你会失败。 当你想用 Jinja Template 时,你会失败。 当你 Cluster 开到一半突然想改文件时,你会发现改完的东西好像不起作用……
更要命的是,当你加载 CycleCloud Dashboard 或者使用 CLI 列出所有集群时,它似乎会把所有的 cloud-init 文件也作为配置的一部分返回给你。 结果你会发现用了很多很长的 Cloud-Init 以后使用 CycleCloud 会变慢很多。
基于以上这些原因,建议使用 Include File 格式,并将真正的 cloud-init 文件放在其他服务器上。
举个例子,我们的 cloud-init 可以这样写
#include-once
https://example.com/kyaru/base.yml
https://example.com/kyaru/sb/head.yml
url 里的 cloud-init 你随便写,想用什么格式用什么格式。赢两次!
CycleCloud CLI
Install cli from About page of CycleCloud dashboard.
Custom image
There are more images available than CycleCloud built-in images. And to make GPU work, we must use image with Generation 2 support.
Azure HPC VM Images contains all images currently available. Choose one you need, and use it's urn to specify it.
Note Some images are incompatible with CycleCloud's built-in templates. For example, you can't use
microsoft-dsvm:ubuntu-hpc:2004:
with Slurm Template. We have a custom template to solve it.
Note: Install on Windows
CycleCloud on Windows depends on cryptography package. So for install script to finish, we need to do this before running script
- Install OpenSSL choco install openssl -pre -y
- Append
C:\Program Files\OpenSSL-Win64\lib;
to environment variableLIB
- Append
C:\Program Files\OpenSSL-Win64\include;
to environment variableINCLUDE
Useful links
SGX Explained
- ifconfig ib0 192.168.*
- change ~/.bashrc for (non-)header conditional setup.
if [ -f /dev/ ]; then #头节点
setup eth .xxx ib() .xxx
fi
- 100 机器 nfs
- slurm 启动开始跑,机器 one by one启动,任务复用、
- 脚本allocate 起停 可以轮流睡觉 轮流slurm
- MIG 启动两套命令/ rmmod nvdia*
- Prometheus + Grafana for all chassis
Performance
echo 2 > /proc/sys/vm/overcommit_memory
ulimit -a ulimited
echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
[ -f "/shared/opt/home/q-e" ] sudo mount 10.0.0.8:/mnt/exports/shared/home /shared/home
...
GeekPie Machine
机器简介
Currently, we are using the SuperMicro 4124GS-TNR server.
- Product Spec
- AS-4124GS-TNR User manual
- H12DSG-O-CPU Motherboard manual
- Information for Lot 9 of ErP (Ecodesign)
CPU: Epyc 7742
AMD claim that theoretical floating point performance can be calculated as: Double Precision theoretical Floating Point performance = #real_cores * 8 DP flop/clk * core frequency. For a 2 socket system =2 * 64 cores * 8 DP flops/ clk * 2.2 GHz=2252.8 Gflops. This includes counting FMA as two flops.
GPU
RDMA
a1:00.0 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
a1:00.1 Infiniband controller [0207]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
- Official Documention
- IB 卡的通讯协议 https://www.rdmamojo.com/2013/06/01/which-queue-pair-type-to-use/
- OpenMPI 使用 http://scc.ustc.edu.cn/zlsc/user_doc/html/mpi-application/mpi-application.html
RAID
e6:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230] (rev 11) (prog-if 01 [AHCI 1.0])
Official brief: https://www.marvell.com/content/dam/marvell/en/public-collateral/storage/marvell-storage-88se92xx-product-brief-2012-04.pdf
To configure the RAID controller, the easiest way is to press Ctrl+M during booting.
If you want to boot a system on RAID, please use Legacy mode. If you switched to UEFI only, you can't find the controller even if you change it back later. To solve it, see Supermicro FAQ Entry
Firmware
It's possible to flash firmware, see Marvell 9230 Firmware Updates and such. Our current firmware is 1070 (bios oprom version)
. If you want to flash another firmware, you might need to make a FreeDOS bootable disk.
Note: Do backup before flashing!
Many links to firmware or utilities are broken. Station Drivers may still work. Also refer Marvell 92xx A1 Firmware Image Repository, it have a full collection of firmware images.
You can find supermicro's firmware from official site but you can't download it. Try download from http://members.iinet.net.au/~michaeldd/.
NVMe
Installed with https://www.asus.com/us/Motherboards-Components/Motherboards/Accessories/HYPER-M-2-X16-CARD-V2/.
21:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
22:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
23:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
24:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
⚠️ The PCIE socket of the NVME card must be configured as 4x4x4x4
so as to be recognized by the system correctly.
The card may have problems. If you find it doesn't work correctly, ask in Slack.
Other links
RAID Controller
MegaRAID
LSI_SAS_EmbMRAID_SWUG.pdf 2006 LSI_SAS_EmbMRAID_SWUG.pdf
ASrock
Win-Raid
Help-Problem-to-flash-the-Marvel-SE-card-resolve
Syba-SI-PEX-PCIe-Card-with-Marvell-SATA-Controller
http://members.iinet.net.au/~michaeldd/CDR-A1-UP_1.01_for_Intel_A1_UP_platform.zip
Supermicro superserver bios change cause 960 nvme disappear
https://tinkertry.com/supermicro-superserver-bios-change-can-cause-960-pro-and-evo-to-hide-heres-the-fix
Background knowlodge
- Don't do it: consumer-grade solid-state drives (SSD) in Storage Spaces Direct
- PCI Option ROM
- UEFI Validation Option ROM Guidance
Software
Currently we are using Ubuntu Server 20.04.3 LTS.
Boot: systemd-boot
We have replaced grub
with systemd-boot
. For introduction, see systemd-boot - ArchWiki (archlinux.org).
To configure systemd-boot
, use bootctl
.
To change kernel parameters, modify /etc/kernel/postinst.d/zz-update-systemd-boot
.
GitHub backup: https://gist.github.com/KiruyaMomochi/9df313c2abc55c1736d457d48abc0f54
Network: netplan
Since Systemd v197, network interfaces use predictable naming schemes. See systemd.net-naming-scheme (www.freedesktop.org) for detail.
Ubuntu use netplan to configure network. It reads network configuration from /etc/netplan/*.yaml
, then convert them to systemd-networkd configuration.
Netplan configuration examples: https://netplan.io/examples/.
Drivers
InfiniBand
- Download drivers from Linux InfiniBand Drivers (mellanox.com).
tar -xzf MLNX_OFED_LINUX-5.4-3.1.0.0-ubuntu20.04-x86_64.tgz
cd
to the directory and
sudo ./mlnxofedinstall --add-kernel-support
Configure IPoIB
For RHEL/CentOS, see IP over InfiniBand (IPoIB) - MLNX_OFED v5.1-0.6.6.0 - Mellanox Docs.
For Ubuntu, create /etc/netplan/10-infiniband.yaml
with:
network:
version: 2
ethernets:
ibp161s0f0: # Name of the InfiniBand interface
addresses:
- 11.4.3.20/24 # Change to your IP address
You may need to change interface name and ip address to your own.
Ansible
To manage two server at the same time, it's easier to use Ansible.
Network File System (NFS)
NFS is exported from node1
. Only NFS v4 is supported:
/srv/nfs4 *(rw,sync,fsid=0,crossmnt,no_subtree_check)
/srv/nfs4/home *(rw,sync,no_subtree_check)
/proc/fs/nfsd/versions: -2 -3 +4 +4.1 +4.2
It is mounted on all nodes at /mnt/nfs4
and /mnt/home
:
node1:/ /mnt/nfs4 nfs rw,noauto,x-systemd.automount 0 0
node1:/home /mnt/home nfs rw 0 0
You can use /mnt/home/<user>
as your home directory:
# On node 1
sudo mkdir /srv/nfs4/home/<user>
# On all nodes
sudo usermod -d /mnt/home/<user> <user>
Other tools
systemd-nspawn
See systemd-nspawn
Tuning
Enable / Disable SMT (HyperThreading)
Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading.
# From https://docs.kernelcare.com/how-to/
# Check the SMT state
cat /sys/devices/system/cpu/smt/active
#Enable SMT
echo on > /sys/devices/system/cpu/smt/control
#Disable SMT
echo off > /sys/devices/system/cpu/smt/control
Tick-free CPU
When kernel is booted with nohz_full=1-127
set, CPU 1-127 are isolated. Refer CPU Isolation - Nohz_full - by SUSE Labs (part 3) | SUSE Communities for more details.
Also see:
- 3.13. Isolating CPUs Using tuned-profiles-realtime Red Hat Enterprise Linux for Real Time 7 | Red Hat Customer Portal.
- NO_HZ: Reducing Scheduling-Clock Ticks
A full list of kernel parameters is available at https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html.
Set kernel.yama.ptrace_scope
to 0
For temporary applying, use the following command
sudo sysctl -w kernel.yama.ptrace_scope=0
For permanent, change /etc/sysctl.d/10-ptrace.conf
.
For documentation, see The Linux kernel user’s and administrator’s guide » Linux Security Module Usage - Yama.
Kernel
We use a custom kernel with NOHZ support enabled.
Build Kernel on Debian/Ubuntu
To build kernel, refer to Chapter 4. Common kernel-related tasks (pages.debian.net).
Current kernel config is at /usr/src/linux-headers-$(uname -r)/.config
.
Kernel/BuildYourOwnKernel - Ubuntu Wiki and BuildADebianKernelPackage - Debian Wiki are obsolete, do not use them.
If you don't want to use module signing:
scripts/config --disable MODULE_SIG
scripts/config --disable SYSTEM_TRUSTED_KEYS
Also consider disable debug info:
scripts/config --disable DEBUG_INFO
systemd-nspawn
systemd-nspawn is like the chroot command, but it is a chroot on steroids. See systemd-nspawn - ArchWiki (archlinux.org) and nspawn - Debian Wiki for introduction.
Bootstrap
We can bootstrap a Debian machine using debootstrap
, but also try mkosi
.
For example, bootstrap a openSUSE image:
python3 -m pip install --user git+git://github.com/systemd/mkosi.git
sudo .local/bin/mkosi -d opensuse -t directory -p systemd-container --checksum --password password -o /var/lib/machines/opensuse-test
RDMA
Install
Although there is no document for systemd-nspawn, we can refer to How-to: Deploy RDMA accelerated Docker container over InfiniBand fabric.
Make sure these tools has the same version as host.
We only need to install userspace tools into nspawn container without updating firmware:
./mlnxofedinstall --user-space-only --without-fw-update
Edit .nspawn
file
Edit .nspawn
file of the container, which is located at /etc/systemd/nspawn/<machine-name>.nspawn
.
If such a file does not exist, create one.
Then, add following content
[Exec]
Capability=CAP_IPC_LOCK
LimitMEMLOCK=infinity
[Files]
Bind=/dev/infiniband/
Bind=/dev/hugepages
Also consider use host network by
[Network]
VirtualEthernet=no
Add DeviceAllow
Create a drop-in file use command
sudo systemctl edit systemd-nspawn@<machine-name>
with content of
[Service]
DeviceAllow=/dev/infiniband/uverbs0 rwm
DeviceAllow=/dev/infiniband/uverbs1 rwm
Put all of devices you want to allow there.
Test
Show status with ibstat
. Test RDMA with perftest
.
If you find tools like perftest
does not work, it may releated to
- https://gist.github.com/zshi-redhat/c7cfe9e0be63f0330952a28792acff2b
- Limit on
memlock
, see below for solution.
Disable memlock
limit
IB tools may fail to allocate memory if memlock limit is too small.
To show current memlock
limit, use
sudo systemctl show systemd-nspawn@<machine-name> --property LimitMEMLOCK
To disable limit, use
sudo systemctl edit systemd-nspawn@<machine-name>
And add LimitMEMLOCK=infinity
to [Service]
section, then restart your container.
Troubleshooting
No color in terminal
See Arch wiki for "broken colors" problem.
Create file /etc/systemd/system/container-getty@.service.d/term.conf
in container with following contents:
[Service]
Environment=TERM=xterm-256color
Archived
Pages under this path may outdated and not reflect current setup.
Cluster Setup
Warning This is an outdated guide.
完整流程 & 踩坑笔录
机器信息 & 硬件准备
- 节点:4 节点 (node1~4,node1 为 主节点)
- 网络:Ethernet(
192.168.<A>.x
)与 IB(192.168.<B>.x
)- 星状拓扑
- Setup 时主节点需连接外网
- 硬盘:每个节点一块系统盘,主节点额外挂一块 SSD 作为共享存储
- Clonezilla 镜像 U 盘 * 1(镜像直接解压即可,故下述安装时需要 BIOS 设置为 UEFI 模式)
- Clean Minimal CentOS7 镜像 U 盘 * 1(同上)
CentOS 操作系统安装
下载 CentOS-7 Minimal 镜像 于 U 盘,插于主节点
如果主板 BIOS 启动模式不是 UEFI 则勿忘在启动时修改 ;( 主节点需要使用外置 Clonezilla 镜像 U 盘,故也把 U 盘启动顺序置前
主节点开机 “install CentOS 7”
如果 Install 后触发了
dracut-init... timeout
,在之后会跳入dracut
命令行,输入lsblk
后找到 U 盘设备,记下LABEL=AAA
的值,而后reboot
;然后在选择界面按e
,修改第二行中的LABEL=BBB
的第一段 为AAA
,然后ctrl+x
即可 另一种方法是将LABEL=CentOS\x207\x20x\86_64修改为LABEL=CentOS\x207\x20x\8 https://blog.csdn.net/qq_36937234/article/details/82996998 https://access.redhat.com/solutions/2515741
需调整项如下:
- 磁盘分区
/
+/boot
即可,根目录各子目录不分散分区,格为ext4 - 主机名 Hostname 设为
node1
待安装完成,以 root
用户可正常登陆
关闭 SELinux:修改 /etc/selinux/config
,设置 SELINUX=disabled
关闭 Firewall 防火墙:
systemctl stop firewalld.service
systemctl disable firewalld.service
很多问题都会由上述两个安全服务引起,在超算比赛内网环境下无用,先全关闭
以太网连接配置
先配置主节点连接外网,再将各节点内网连接
Ethernet 外网
连接外网以太网线(记住对应网口 <INTERFACE>
,e.g. eno2
)
使用 ip
指令检查 DNS 地址等信息,而后在输入 nmtui
进入 GUI 网络设置界面,设置外网连接为 DHCP 模式,填入 DNS 服务器地址,而后使用 curl
进行校网登录:
$ dhclient -v <INTERFACE>
$ curl -X POST --data “userName=<USERNAME>&password=<PASSWD>&hasValidateCode=false&authLan=zh_CN” https://10.15.44.172:8445/PortalServer//Webauth/webAuthAction\!login.action
此时可以连接外网,需要记录下本机 ip 地址以便远程连接(校网中途不关机则 DHCP ip 地址应该不会改变);curl <URL>
检查外网连接
Ethernet 内网
同样使用 ip
工具看到网关地址等信息,使用 nmtui
GUI 工具对内网网口(e.g. eno1
)进行配置,e.g. 主节点 192.168.<A>.1
驱动下载 & 安装
IB 驱动 & Nvidia 驱动,安装在默认位置(因为共享盘还未配置),故在拷盘前做好
IB 驱动和配置
Nvidia 驱动
yum install kernel-dev epel-release dkms
来添加 Redhat
源 及 其他 Nvidia 驱动依赖
关闭默认 nouveau
显卡驱动:
$ vi /etc/default/grub # `GRUB_CMDLINE_LINUX` 选项中添加 `nouveau.modeset=0`
$ grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
$ reboot
不要用官网的 rpm 包安装驱动 ;(
重启后,在官网查询所用卡对应的最新驱动版本 <VER.SUB>
(e.g. V100 目前最新为 410.79
),获取安装脚本并安装:
$ wget http://us.download.nvidia.com/XFree86/Linux-x86_64/<VER.SUB>/NVIDIA-Linux-x86_64-<VER.SUB>.run
$ bash NVIDIA-Linux-x86_64-<VER.SUB>.run --kernel-source-path /usr/src/kernels/xxx # 若报错加此选项安装试试
试 nvidia-smi
指令看能否获取到显卡信息
克隆创建子节点
先安装必要的基本工具以减少重复工作:yum -y install <TOOL-NAME>
:
- NFS:
nfs-utils rpcbind
- Lmod:
environment-modules
- 其他:
gcc gcc-c++ perl wget
(通过yum
预安装 gcc 用于并行库工具等的编译安装)
主节点关机,插入 Clonezilla 工具盘,从它启动,将主节点系统盘克隆至子节点系统盘内(勿搞错拷贝 Source & Target 盘方向):https://www.tecmint.com/linux-centos-ubuntu-disk-cloning-backup-using-clonezilla
子节点插入系统盘后,分别登陆各子节点,修改主机名和静态 ip 地址(内网网口),以便互联识别身份,注意 4 节点 ip 和 主机名 互不相同:
# e.g. node2 节点
$ hostnamectl set-hostname node2
$ vi /etc/sysconfig/network-scripts/ifcfg-<INTERFACE> #修改 IPADDR=192.168.<A>.2
数据盘 NFS 共享(over IB RDMA)
howto-configure-nfs-over-rdma--roce-x
Maybe useful according to teacher Zhang
opensmd
openibd
数据盘 NFS 共享(over TCP)备选
主节点插上用作共享盘的硬盘,lsblk
查看新硬盘已插上及名称,可看到出现 e.g. sdb1
盘(根据大小判断那个为共享盘,勿搞错)
格式化磁盘流程备忘:
$ fdisk /dev/sdb1
$ n # 新建分区
$ p 1 [Enter] [Enter] # 整个盘建立为一个大主分区
挂载该磁盘并在其中建立欲共享的目录(/home
和 /opt
):
$ mount /dev/nvme0n1 /mnt/nfs
$ mkdir /mnt/nfs/home
$ mkdir /mnt/nfs/opt
主节点启动 NFS server,编辑共享目录配置 /etc/exports
,添加条目(注意不多加空格):
/mnt/nfs/home 192.168.<A>.0/24(rw,no_root_squash,no_all_squash,sync)
/mnt/nfs/opt 192.168.<A>.0/24(rw,no_root_squash,no_all_squash,sync)
参数解释:
rw
:可读写no_*_squash
:客户节点以 * 身份使用时不降级为匿名普通用户sync
:各端的写操作同步至磁盘
开启服务并设置开机自启:
$ exportfs -r
$ service rpcbind start
$ service nfs start
$ chkconfig rpcbind on
$ chkconfig nfs on
设置主节点防火墙允许 NFS 访问请求:
$ firewall-cmd --permanent --add-service=mountd
$ firewall-cmd --permanent --add-service=nfs
$ firewall-cmd --permanent --add-service=rpc-bind
$ firewall-cmd --reload
修改 /etc/fstab
,使主节点将共享目录 bind mount (目录树到目录树挂载) 到 /home
/opt
,子节点由 NFS 将主节点目录挂载:
# On node1,在 /etc/fstab 文件末尾添加
/dev/nvme0n1 /mnt/nfs ext4 rw,user,exec,suid,dev,auto,async
/mnt/nfs/home /home none rw,user,exec,suid,dev,auto,async,bind
/mnt/nfs/opt /opt none rw,user,exec,suid,dev,auto,async,bind
# On node2~4,在 /etc/fstab 文件末尾添加
node1:/mnt/nfs/home /home nfs rw,user,exec,suid,dev,auto,async
node1:/mnt/nfs/home /opt nfs rw,user,exec,suid,dev,auto,async
而后每次开机后,各节点均登入 root 用户,先在主节点 mount -a
,后在各子节点 mount -a
即可成功挂载共享目录
全手动挂载方式备忘,开机后首先在主节点:
$ mount /dev/nvme0n1 /mnt/nfs
$ mount --bind /mnt/nfs/home /home
$ mount --bind /mnt/nfs/opt /opt
而后在各子节点:
$ showmount -e node1 # 检测是否有来自主节点的 nfs 共享
$ mount -t nfs node1:/mnt/nfs/home /home
$ mount -t nfs node1:/mnt/nfs/opt /opt
出现 “Stale file handle” 问题 / “Access denied” 问题,在主节点重启 NFS:
systemctl restart nfs
后再挂载一遍即可
SSH 免密码登录配置
首先配置 root 用户相互 ssh 免密登陆,所有节点对之间均需配置,e.g. 在主节点 /root
下:
$ ssh-keygen # 位置名称默认
$ ssh-copy-id node1 # 自身节点也需拷贝
$ ...
$ ssh-copy-id node4
而后在各自节点 均 创建普通用户,注意 相同名称 & 相同 uid & 相同 group (gid) & 相同密码:
$ useradd <USERNAME> -m
$ passwd <USERNAME>
$ [Type new PASSWORD] [Type again] # 设置密码,不要通过 useradd 的 -p 选项,密码不规范时会失败
密码通过
passwd
指令设置,否则密码不规范时-p
选项可能失败且不会给出提示 按相同顺序创建即是,可以通过cat /etc/passwd
检查
任意节点进入普通用户,生成并拷贝密钥(注意普通用户 Home 目录共享):
$ su testuser
$ cd
$ ssh-keygen
[Enter] [Enter] [Enter]
$ ssh-copy-id localhost
编译器、并行库和环境的安装
环境安装目录文件树放置于 /opt
下
所需环境及安装流程 - 见 “Environment Installation”
环境 Environment Modules 配置
前面已下载过 Lmod 工具;共享盘中 mkdir /opt/modulefiles
作为 modulefile 存储位置,而后在 每个 节点固定 modulefile 搜索路径,于 /etc/environment
中添加行:
export MODULEPATH=/opt/modulefiles
勿忘
source /etc/environment
曾用 modulefile 文件 - 见 “Modulefile Records”
Environment Installation
Warning This is an outdated guide.
环境安装方式 + 目录树位置
安装目录树结构
|- /opt/
|- openmpi/
|- 4.0
|- 3.1
|- ...
|- mpich/
|- intel/ # Intel 全家福
|- blas/
|- gcc/
|- cuda/ # Nvidia CUDA
|- pgi/ # CUDA PGi Edition
|- netcdf/
|- netcdf-c/
|- netcdf-fort/
|- pnetcdf
编译安装基本六连:
$ wget [SOURCE_URL]
$ tar zxvf openmpi-4.0.0.tar.gz
$ cd openmpi-4.0.0/
$ ./configure --prefix=‘/opt/mpi/openmpi/4.0’ # 注意规划好位置
$ make -j8
$ make install
包管理系统
spack Environment Modules 二选一,
spack
基本上就是对系统级的modules
的高层API,从ASC20开始,我们开始使用spack
。为保证目录树nfs共享结构,把spack
放在/opt 目录下。
$ git clone https://github.com/spack/spack.git $ cd spack/bin $ ./spack install libelf # test $ echo "export PATH=$PATH:/opt/spack/bin" >> ~/.bashrc $ echo ". /opt/spack/share/spack/setup-env.sh" >> ~/.bashrc $ bash
依赖以及版本spec
$ spack install intel^gcc@9
在测试时使用
$ spack load intel^gcc@9 也可以
module avail
后查看需要load 的环境module load intel
添加新的编译器,在自己编译安装好一个编译器并在PATH
中可以找到的时候,可以使用spack find
命令,之后就可以用这个编译器@intel
来编译新的编译器了。
特殊注意,在编译安装mpi
,omp
过程中一定要开启 --with-rdma 选项以支持Infiniband.
编译器
- gcc (包含 gfortran) - Version 7.4 + 5.5 + 4.9.4 + 4.4.7
CentOS
自带的gcc版本过老,可以使用scl enable devtoolset-9 bash
以支持最新gcc特性。- 7.4:gcc-7.4.0.tar.gz
- 5.5:gcc-5.5.0.tar.gz
- 4.4.7:gcc-4.4.7.tar.gz
- icc & ifort:包含于 Intel Parallel Studio XE 中
Intel Parallel Studio XE 全家桶
Parallel Studio XE:按照 This Procedure 获取和安装,19-20 授权如下
- 序列号 S4ZD-MMZJXJ96 (若失效,可以前往英特尔官网申请,在下方register center,若为spack 安装只需在安装过程中输入即可)
- URL:parallel_studio_xe_2019_update2_cluster_edition.tgz
- LICENSE:官网 Registration Center 下载后传至 Server
icc
ifort
mkl
IntelMPI
均包含于 Parallel Studio XE 中spack install intel
由于cuda只支持编译它的编译器头文件在gcc-7标准以前,所以建议使用intel@18.0.3
MPI
- OpenMPI - Version 4.0 + 3.1 + 3.0 + 2.1
- MPICH - Version 3.3 + 3.2.1
- 3.3:mpich-3.3.tar.gz
- 3.2.1:mpich-3.2.1.tar.gz
- IntelMPI:包含于 Intel Parallel Studio XE 中
Nvidia CUDA
- CUDA Toolkit:
spack install cuda@10.2
以支持不同版本。 - PGi Edition:
spack install pgi@19.10
Math Libraries
- MKL:包含于 Intel Parallel Studio XE 中
- OpenBLAS:
spack install openblas
NetCDF I/O
用于ASC19 的IO500题目中。
Environment Modules
Warning This is an outdated guide.
Environment Module: Modulefiles 目录树结构 + 备份
Modulefiles 目录树结构 (Deprecated)
|- /opt/modulefiles # 路径并非完全按照冲突关系组织,modulefile 中冲突关系要注意
|- mpi/
|- openmpi/
|- 4.0
|- 3.1
|- ...
|- mpich/
|- intelmpi/
|- math/
|- mkl/
|- blas/
|- compilers/
|- gcc/
|- icc/
|- ifort/
|- cuda/
|- nvidia/
|- pgi/
|- netcdf/
|- pnetcdf/
|- netcdf-c/
|- netcdf-fort/
It-Support Machine
机器简介
我们一共拥有四个图信账号,分别是:GeekPieHPC{1, 2, 3, 4}
。 四个账号均位于同一 IP 地址下,请到 Slack 查看机器的 IP 地址。
连接方式
这些账号均可使用 ssh
命令连接,或使用 scp
命令进行文件传输,端口号为 22112
。
四个账号均位于同一 IP 地址下,见 https://geekpiehpc.slack.com/archives/C0210BA22QH/p1631708325019000。
-
密钥: 您可以使用连接到 Epyc 机器的密钥进行登录,示例配置如下
Host geekpie<数字> HostName <在这里填上目标机器的 IP 地址> User geekpie<数字> Port 22112 IdentityFile ~/.ssh/id_rsa_epyc
-
密码: 您可在 Slack 中找到密码。
调度器
图信机器使用 PBS (Portable Batch System) 进行调度,其多数命令以 q
开头,如可以使用 qstat
查看调度器的状态。
CPU 队列为 GeekPie_HPC
,GPU 队列为 GeekPie_GPU
。
PBS 的具体使用方式请看 DevOps/Scheduler。
环境管理
在图信机器上,首推使用 module
进行环境变量的管理,不过和编译器打交道时,还是使用 spack 安装编译器。
支持与帮助
可以使用以下方式联系图信
- 微信: 群中的 Saber 为图信联络人。
- 办公室: 图信办公室位置是
H1 304
,H1 楼是医务室那栋楼。
BMC fuck
We think it's worthwhile to reverse the BMC for the following reasons:
- Fine-grained adjustment of GPU power consumption (super TDP)
- Can be 1 machine 4 cards. (Tsinghua last year on 2 cards)
- PCIe device hot-swapping
Also see:
- https://github.com/l4rz/reverse-engineering-dell-idrac-to-get-rid-of-gpu-throttling
Salt Stack
salt 主要干的事就是快速配置文件。可也不止那么多,加上 jinja 和 LDAP 可以做一个私钥管理系统。
由于前人维护者跑路了,只能有新来的同学接手了。
linux 的秘密都在 PAM 里
大家熟练使用 journal -x
debug 各种系统用户认证的各种状态机的时候,抑或是 sshd 时,可以看到 PAM、xsecurity、sssd 这种字眼,这是用户认证协议。SSSD 就是一个 daemon,把系统的 NSS PAM 的机制和 LDAP 连接起来。(之前 20.04 gnome 被打也是这个协议被绕过的结果。
熟悉了 PAM 之后,你也就知道为毛 ~/.ssh/id_*.pub
需要 r..
,这写死在 nss 用户目录协议里了。
继续阅读:https://jia.je/software/2021/02/15/sssd-ldap/
ref
- RFC2307 An Approach for Using LDAP as a Network Information Service
- RFC3062 LDAP Password Modify Extended Operation
- RFC4511 Lightweight Directory Access Protocol (LDAP): The Protocol
- RFC4512 Lightweight Directory Access Protocol (LDAP): Directory Information Models
- RFC4513 Lightweight Directory Access Protocol (LDAP): Authentication Methods and Security Mechanisms
- RFC4517 Lightweight Directory Access Protocol (LDAP): Syntaxes and Matching Rules
- RFC4519 Lightweight Directory Access Protocol (LDAP): Schema for User Applications
C++
终于到我校专业学习的语言了。
C++ 17 & 20
有关新特性永远是恒久不变的话题。从 C++11 的左右值到现在已经有很久时间了。在epyc 上一般的编译器性能排序是 AOCC>Intel>GCC>>LLVM. 但是MKL相对于amd的优化库还有一定优势,所以有时候Intel开最基本的x86优化也能和AOCC差不多。C++ 17对并行化做了很多优化,比如pstl、for_each(threading)、pmr、intel 的 sycl、nvidia的thrust/cub。可以很方便的修改namespace来获得无痛性能提升。C++ 20 最重要的特性是 ranges、filesystem等。不过LLVM 对新版本的支持一直都是最慢的。之前 icc 可以兼容一些古老的语法比如 VLA,但到了新版本以后取消了,这种不稳定的特性变更导致大厂不怎么把其当成标准。
这是笔者在其编译原理作业中出的一道题,需要给出 semantic rule 来拒绝 ??? 那行。但貌似只能在icc中通过标准。
template<class T>class array{
int s;
T* elements;
public:
array(int n); // allocate "n" elements and let "elements" refer to them
array(T* p, int n); // make this array refer to p[0..n-1]
operator T*(){return elements;}
int size()const{return s;}
// the usual container operations, such as = and [], much like vector
};
void h(array<double>a); //C++
void g(int m,double vla[m]); //C99
void f(int m,double vla1[m],array<double>a1) {
array<double> a2(vla1,m); // a2 refers to vla1
double*p=a1; //p refers to a1's elements
g(m,vla1);
g(a1.size(),a1); // a bit verbose
g(a1); //???
}
有关玄学
所有编译器可能出现的segfault,很多时候换个 intel 小版本可以通过。
有关编译自闭的时候的想法
建议若是 CMake 打开 make -n
, configure 打开 make VERBOSE=1
。如果需要与展开宏编译器,需要 CMake 开 -e 选项得到展开后的表达式。
需要广泛运用好 man、--help。
编译选项
LTO
为了解决不同库或者跨语言之间调用的开销,这块使用的基本是 LLVM 的 libLTO 和 tblgen。这个是自动开启的,原理是把库都弄成 LLVM bitcode 统一链接,其实并行版的 LTO 也不是很难实现,曾是前队长用 rust 写的并行计算的 project,具体可以参考源码。
PGO
通过分析程序运行时的实际行为,将结果反馈给编译器,使得编译器可以重新安排代码以减少指令缓存问题和分支预测误判,从而获得性能的提升。性能分析引导优化通过实际执行代码统计出运行频率最高的部分,编译器通过这些信息可以更加针对地优化代码。
- 第一阶段:编译参数中加上:
-prof-gen=srcpos
-prof-dir=/tmp/profdata
。其中-prof-dir
是存储性能分析文件的目录。 - 第二阶段:运行编译好的程序,然后运行
profmerge -prof_dir /tmp/profdata
生成汇总文件。 - 第三阶段:重新编译程序,使用参数:
-prof-use=nomerge -prof-func-groups -prof-dir=/tmp/profdata
。
IPO
过程间优化和性能分析引导优化可能会相互影响,性能分析引导优化通常会帮助编译器生成内联函数,这会帮助过程间优化的效率。性能分析引导优化对分支预测效率的提升最有效果,许多分支执行的可能性无法在编译时判断,而通过性能分析引导优化,编译器可以针对经常执行的分支(热代码)和不经常执行的分支(冷代码)生成高效的汇编代码。
HLO
从下文我们知道的 LLVM IR 就已经出现的部分优化,我们知道 icc 实际上在 LLVM IR 之前还拥有 high level 的 ir。根据文档,主要做了
- Loop Permutation or Interchange
- Loop Distribution
- Loop Fusion
- Loop Unrolling
- Data Prefetching
- Scalar Replacement
- Unroll and Jam
- Loop Blocking or Tiling
- Partial-Sum Optimization
- Predicate Optimization
- Loop Reversal
- Profile-Guided Loop Unrolling
- Loop Peeling
- Data Transformation: Malloc Combining and Memset Combining, - Memory Layout Change
- Loop Rerolling
- Memset and Memcpy Recognition
- Statement Sinking for Creating Perfect Loopnests
- Multiversioning: Checks include Dependency of Memory References, - and Trip Counts
- Loop Collapsing
DOP
大家如果写过 UE 这种游戏引擎的程序或者kernel中需要 cache align 的struct,以及最近几年对DB内卷式的优化,会很熟悉这种数据结构。其最重要的思想就是把数据塞在 avx 对齐的 struct 中,所有的操作都是围绕着对struct的加乘运算。详见 https://neil3d.github.io/assets/img/ecs/DOD-Cpp.pdf。
LLVM
DC++/AOCC 都开始使用 LLVM 作为中间层
ICC 大致做了什么新特性
以最新的 2021.3.0 做静态分析,用 saxpy
做标程。意思是 Single-Precision A·X Plus Y。这是一维 BLAS 中的一个函数,经常被写作 kernel 来各种调参各种调寄存器和内存模型。其 C++ 版本如下
void saxpy(int n, float a, float * x, float * y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
LLVM IR 如下,完整路径是 https://godbolt.org/z/j5rrxhedG,主要hard code 进了各种预优化好的汇编,尤其是mov高地址这种快速取指方式。感觉是把VADDSS __m128 _mm_mask_add_ss (__m128 s, __mmask8 k, __m128 a, __m128 b);
这种 Intel C/C++ Compiler Intrinsic Equivalent 当成库函数一起编译到 IR 上了。
.section .text
.LNDBG_TX:
# mark_description "Intel(R) C Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.3.0 Build 2021";
# mark_description "0609_000000";
# mark_description "-g -o /app/output.s -masm=intel -S -gxx-name=/opt/compiler-explorer/gcc-10.1.0/bin/g++ -emit-llvm";
.intel_syntax noprefix
.file "example.cpp"
.text
..TXTST0:
.L_2__routine_start_saxpy(int, float, float*, float*)_0:
# -- Begin saxpy(int, float, float*, float*)
.text
# mark_begin;
.globl saxpy(int, float, float*, float*)
# --- saxpy(int, float, float *, float *)
saxpy(int, float, float*, float*):
# parameter 1(n): edi
# parameter 2(a): xmm0
# parameter 3(x): rsi
# parameter 4(y): rdx
..B1.1: # Preds ..B1.0
# Execution count [0.00e+00]
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
..___tag_value_saxpy(int, float, float*, float*).2:
..L3:
#2.1
..LN0:
.file 1 "/app/example.cpp"
.loc 1 2 is_stmt 1
push rbp #2.1
.cfi_def_cfa_offset 16
..LN1:
mov rbp, rsp #2.1
.cfi_def_cfa 6, 16
.cfi_offset 6, -16
..LN2:
sub rsp, 48 #2.1
..LN3:
mov DWORD PTR [-40+rbp], edi #2.1
..LN4:
movss DWORD PTR [-32+rbp], xmm0 #2.1
..LN5:
mov QWORD PTR [-24+rbp], rsi #2.1
..LN6:
mov QWORD PTR [-16+rbp], rdx #2.1
..LN7:
.loc 1 3 prologue_end is_stmt 1
mov DWORD PTR [-48+rbp], 0 #3.14
..LN8:
# LOE rbx rbp rsp r12 r13 r14 r15 rip
..B1.2: # Preds ..B1.3 ..B1.1
# Execution count [0.00e+00]
..LN9:
mov eax, DWORD PTR [-48+rbp] #3.19
..LN10:
mov edx, DWORD PTR [-40+rbp] #3.23
..LN11:
cmp eax, edx #3.23
..LN12:
jge ..B1.4 # Prob 50% #3.23
..LN13:
# LOE rbx rbp rsp r12 r13 r14 r15 rip
..B1.3: # Preds ..B1.2
# Execution count [0.00e+00]
..LN14:
.loc 1 4 is_stmt 1
movss xmm0, DWORD PTR [-32+rbp] #4.14
..LN15:
mov eax, DWORD PTR [-48+rbp] #4.18
..LN16:
movsxd rax, eax #4.16
..LN17:
imul rax, rax, 4 #4.16
..LN18:
add rax, QWORD PTR [-24+rbp] #4.16
..LN19:
movss xmm1, DWORD PTR [rax] #4.16
..LN20:
mulss xmm0, xmm1 #4.16
..LN21:
mov eax, DWORD PTR [-48+rbp] #4.25
..LN22:
movsxd rax, eax #4.23
..LN23:
imul rax, rax, 4 #4.23
..LN24:
add rax, QWORD PTR [-16+rbp] #4.23
..LN25:
movss xmm1, DWORD PTR [rax] #4.23
..LN26:
addss xmm0, xmm1 #4.23
..LN27:
mov eax, DWORD PTR [-48+rbp] #4.9
..LN28:
movsxd rax, eax #4.7
..LN29:
imul rax, rax, 4 #4.7
..LN30:
add rax, QWORD PTR [-16+rbp] #4.7
..LN31:
movss DWORD PTR [rax], xmm0 #4.7
..LN32:
.loc 1 3 is_stmt 1
mov eax, 1 #3.28
..LN33:
add eax, DWORD PTR [-48+rbp] #3.28
..LN34:
mov DWORD PTR [-48+rbp], eax #3.28
..LN35:
jmp ..B1.2 # Prob 100% #3.28
..LN36:
# LOE rbx rbp rsp r12 r13 r14 r15 rip
..B1.4: # Preds ..B1.2
# Execution count [0.00e+00]
..LN37:
.loc 1 5 epilogue_begin is_stmt 1
leave #5.1
.cfi_restore 6
..LN38:
ret #5.1
..LN39:
# LOE
..LN40:
.cfi_endproc
# mark_end;
.type saxpy(int, float, float*, float*),@function
.size saxpy(int, float, float*, float*),.-saxpy(int, float, float*, float*)
..LNsaxpy(int, float, float*, float*).41:
.LNsaxpy(int, float, float*, float*):
.data
# -- End saxpy(int, float, float*, float*)
.data
.section .note.GNU-stack, ""
// -- Begin DWARF2 SEGMENT .debug_info
.section .debug_info
.debug_info_seg:
.align 1
.4byte 0x000000be
....
汇编如下,可以看到每一个分支都有概率预测。自动向量化。
saxpy(int, float, float*, float*):
mov r9, rsi #2.1
test edi, edi #3.23
jle ..B1.36 # Prob 50% #3.23
cmp edi, 6 #3.3
jle ..B1.30 # Prob 50% #3.3
movsxd r8, edi #1.6
mov rax, rdx #4.16
sub rax, r9 #4.16
lea rcx, QWORD PTR [r8*4] #3.3
cmp rax, rcx #3.3
jge ..B1.5 # Prob 50% #3.3
neg rax #4.23
cmp rax, rcx #3.3
jl ..B1.30 # Prob 50% #3.3
..B1.5: # Preds ..B1.4 ..B1.3
cmp edi, 8 #3.3
jl ..B1.38 # Prob 10% #3.3
mov r10, rdx #3.3
and r10, 15 #3.3
test r10d, r10d #3.3
je ..B1.9 # Prob 50% #3.3
test r10d, 3 #3.3
jne ..B1.38 # Prob 10% #3.3
neg r10d #3.3
add r10d, 16 #3.3
shr r10d, 2 #3.3
..B1.9: # Preds ..B1.8 ..B1.6
lea eax, DWORD PTR [8+r10] #3.3
cmp edi, eax #3.3
jl ..B1.38 # Prob 10% #3.3
mov esi, edi #3.3
xor ecx, ecx #3.3
sub esi, r10d #3.3
and esi, 7 #3.3
neg esi #3.3
add esi, edi #3.3
mov eax, r10d #3.3
test r10d, r10d #3.3
jbe ..B1.14 # Prob 9% #3.3
..B1.12: # Preds ..B1.10 ..B1.12
movss xmm1, DWORD PTR [r9+rcx*4] #4.16
mulss xmm1, xmm0 #4.16
addss xmm1, DWORD PTR [rdx+rcx*4] #4.23
movss DWORD PTR [rdx+rcx*4], xmm1 #4.7
inc rcx #3.3
cmp rcx, rax #3.3
jb ..B1.12 # Prob 82% #3.3
..B1.14: # Preds ..B1.12 ..B1.10
lea rcx, QWORD PTR [r9+rax*4] #4.16
test rcx, 15 #3.3
je ..B1.18 # Prob 60% #3.3
movaps xmm1, xmm0 #1.6
shufps xmm1, xmm1, 0 #1.6
movsxd rcx, esi #3.3
..B1.16: # Preds ..B1.16 ..B1.15
movups xmm2, XMMWORD PTR [r9+rax*4] #4.16
movups xmm3, XMMWORD PTR [16+r9+rax*4] #4.16
mulps xmm2, xmm1 #4.16
mulps xmm3, xmm1 #4.16
addps xmm2, XMMWORD PTR [rdx+rax*4] #4.23
addps xmm3, XMMWORD PTR [16+rdx+rax*4] #4.23
movups XMMWORD PTR [rdx+rax*4], xmm2 #4.7
movups XMMWORD PTR [16+rdx+rax*4], xmm3 #4.7
add rax, 8 #3.3
cmp rax, rcx #3.3
jb ..B1.16 # Prob 82% #3.3
jmp ..B1.21 # Prob 100% #3.3
..B1.18: # Preds ..B1.14
movaps xmm1, xmm0 #1.6
shufps xmm1, xmm1, 0 #1.6
movsxd rcx, esi #3.3
..B1.19: # Preds ..B1.19 ..B1.18
movups xmm2, XMMWORD PTR [r9+rax*4] #4.16
movups xmm3, XMMWORD PTR [16+r9+rax*4] #4.16
mulps xmm2, xmm1 #4.16
mulps xmm3, xmm1 #4.16
addps xmm2, XMMWORD PTR [rdx+rax*4] #4.23
addps xmm3, XMMWORD PTR [16+rdx+rax*4] #4.23
movups XMMWORD PTR [rdx+rax*4], xmm2 #4.7
movups XMMWORD PTR [16+rdx+rax*4], xmm3 #4.7
add rax, 8 #3.3
cmp rax, rcx #3.3
jb ..B1.19 # Prob 82% #3.3
..B1.21: # Preds ..B1.19 ..B1.16
lea eax, DWORD PTR [1+rsi] #3.3
cmp eax, edi #3.3
ja ..B1.36 # Prob 50% #3.3
sub r8, rcx #3.3
cmp r8, 4 #3.3
jl ..B1.39 # Prob 10% #3.3
mov eax, r8d #3.3
xor r10d, r10d #3.3
and eax, -4 #3.3
lea rdi, QWORD PTR [rdx+rcx*4] #4.23
movsxd rax, eax #3.3
lea rcx, QWORD PTR [r9+rcx*4] #4.16
..B1.24: # Preds ..B1.24 ..B1.23
movups xmm2, XMMWORD PTR [rcx+r10*4] #4.16
mulps xmm2, xmm1 #4.16
addps xmm2, XMMWORD PTR [rdi+r10*4] #4.23
movups XMMWORD PTR [rdi+r10*4], xmm2 #4.7
add r10, 4 #3.3
cmp r10, rax #3.3
jb ..B1.24 # Prob 82% #3.3
..B1.26: # Preds ..B1.24 ..B1.39
cmp rax, r8 #3.3
jae ..B1.36 # Prob 9% #3.3
movsxd rsi, esi #4.7
lea rcx, QWORD PTR [rdx+rsi*4] #4.23
lea rdx, QWORD PTR [r9+rsi*4] #4.16
..B1.28: # Preds ..B1.28 ..B1.27
movss xmm1, DWORD PTR [rdx+rax*4] #4.16
mulss xmm1, xmm0 #4.16
addss xmm1, DWORD PTR [rcx+rax*4] #4.23
movss DWORD PTR [rcx+rax*4], xmm1 #4.7
inc rax #3.3
cmp rax, r8 #3.3
jb ..B1.28 # Prob 82% #3.3
jmp ..B1.36 # Prob 100% #3.3
..B1.30: # Preds ..B1.4 ..B1.2
mov eax, edi #3.3
mov esi, 1 #3.3
xor ecx, ecx #3.3
shr eax, 1 #3.3
je ..B1.34 # Prob 9% #3.3
..B1.32: # Preds ..B1.30 ..B1.32
movss xmm1, DWORD PTR [r9+rcx*8] #4.16
mulss xmm1, xmm0 #4.16
addss xmm1, DWORD PTR [rdx+rcx*8] #4.23
movss DWORD PTR [rdx+rcx*8], xmm1 #4.7
movss xmm2, DWORD PTR [4+r9+rcx*8] #4.16
mulss xmm2, xmm0 #4.16
addss xmm2, DWORD PTR [4+rdx+rcx*8] #4.23
movss DWORD PTR [4+rdx+rcx*8], xmm2 #4.7
inc rcx #3.3
cmp rcx, rax #3.3
jb ..B1.32 # Prob 63% #3.3
lea esi, DWORD PTR [1+rcx+rcx] #4.7
..B1.34: # Preds ..B1.33 ..B1.30
lea eax, DWORD PTR [-1+rsi] #3.3
cmp eax, edi #3.3
jae ..B1.36 # Prob 9% #3.3
movsxd rsi, esi #3.3
movss xmm1, DWORD PTR [-4+r9+rsi*4] #4.16
mulss xmm0, xmm1 #4.16
addss xmm0, DWORD PTR [-4+rdx+rsi*4] #4.23
movss DWORD PTR [-4+rdx+rsi*4], xmm0 #4.7
..B1.36: # Preds ..B1.28 ..B1.21 ..B1.34 ..B1.38 ..B1.1
ret #5.1
..B1.38: # Preds ..B1.5 ..B1.7 ..B1.9
xor esi, esi #3.3
cmp edi, 1 #3.3
jb ..B1.36 # Prob 50% #3.3
..B1.39: # Preds ..B1.22 ..B1.38
xor eax, eax #3.3
jmp ..B1.26 # Prob 100% #3.3
下面是 AOCC LLVM IR emit 出来的代码,并没有在 IR 上做什么文章,和 clang emit 的基本一样。
; ModuleID = './a.c'
source_filename = "./a.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
; Function Attrs: noinline nounwind optnone uwtable
define dso_local void @saxpy(i32 %n, float %a, float* %x, float* %y) #0 {
entry:
%n.addr = alloca i32, align 4
%a.addr = alloca float, align 4
%x.addr = alloca float*, align 8
%y.addr = alloca float*, align 8
%i = alloca i32, align 4
store i32 %n, i32* %n.addr, align 4
store float %a, float* %a.addr, align 4
store float* %x, float** %x.addr, align 8
store float* %y, float** %y.addr, align 8
store i32 0, i32* %i, align 4
br label %for.cond
for.cond: ; preds = %for.inc, %entry
%0 = load i32, i32* %i, align 4
%1 = load i32, i32* %n.addr, align 4
%cmp = icmp slt i32 %0, %1
br i1 %cmp, label %for.body, label %for.end
for.body: ; preds = %for.cond
%2 = load float, float* %a.addr, align 4
%3 = load float*, float** %x.addr, align 8
%4 = load i32, i32* %i, align 4
%idxprom = sext i32 %4 to i64
%arrayidx = getelementptr inbounds float, float* %3, i64 %idxprom
%5 = load float, float* %arrayidx, align 4
%mul = fmul float %2, %5
%6 = load float*, float** %y.addr, align 8
%7 = load i32, i32* %i, align 4
%idxprom1 = sext i32 %7 to i64
%arrayidx2 = getelementptr inbounds float, float* %6, i64 %idxprom1
%8 = load float, float* %arrayidx2, align 4
%add = fadd float %mul, %8
%9 = load float*, float** %y.addr, align 8
%10 = load i32, i32* %i, align 4
%idxprom3 = sext i32 %10 to i64
%arrayidx4 = getelementptr inbounds float, float* %9, i64 %idxprom3
store float %add, float* %arrayidx4, align 4
br label %for.inc
for.inc: ; preds = %for.body
%11 = load i32, i32* %i, align 4
%inc = add nsw i32 %11, 1
store i32 %inc, i32* %i, align 4
br label %for.cond
for.end: ; preds = %for.cond
ret void
}
attributes #0 = { noinline nounwind optnone uwtable "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }
!llvm.module.flags = !{!0}
!llvm.ident = !{!1}
!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"AMD clang version 12.0.0 (CLANG: AOCC_3.0.0-Build#78 2020_12_10) (based on LLVM Mirror.Version.12.0.0)"}
和 icc 编译的向量化部分基本一样,但没有概率模型,可惜上面的概率模型的 cost model 是 intel processor 的,所以最终结果icc和aocc不分伯仲。
.text
.file "a.c"
.globl saxpy # -- Begin function saxpy
.p2align 4, 0x90
.type saxpy,@function
saxpy: # @saxpy
.cfi_startproc
# %bb.0: # %entry
testl %edi, %edi
jle .LBB0_16
# %bb.1: # %for.body.preheader
movl %edi, %r9d
cmpl $7, %edi
jbe .LBB0_2
# %bb.7: # %vector.memcheck
leaq (%rsi,%r9,4), %rax
cmpq %rdx, %rax
jbe .LBB0_9
# %bb.8: # %vector.memcheck
leaq (%rdx,%r9,4), %rax
cmpq %rsi, %rax
jbe .LBB0_9
.LBB0_2:
xorl %ecx, %ecx
.LBB0_3: # %for.body.preheader23
movq %rcx, %rax
notq %rax
testb $1, %r9b
je .LBB0_5
# %bb.4: # %for.body.prol
movss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
mulss %xmm0, %xmm1
addss (%rdx,%rcx,4), %xmm1
movss %xmm1, (%rdx,%rcx,4)
orq $1, %rcx
.LBB0_5: # %for.body.prol.loopexit
addq %r9, %rax
je .LBB0_16
.p2align 4, 0x90
.LBB0_6: # %for.body
# =>This Inner Loop Header: Depth=1
movss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
mulss %xmm0, %xmm1
addss (%rdx,%rcx,4), %xmm1
movss %xmm1, (%rdx,%rcx,4)
movss 4(%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
mulss %xmm0, %xmm1
addss 4(%rdx,%rcx,4), %xmm1
movss %xmm1, 4(%rdx,%rcx,4)
addq $2, %rcx
cmpq %rcx, %r9
jne .LBB0_6
jmp .LBB0_16
.LBB0_9: # %vector.ph
movl %r9d, %ecx
andl $-8, %ecx
movaps %xmm0, %xmm1
shufps $0, %xmm0, %xmm1 # xmm1 = xmm1[0,0],xmm0[0,0]
leaq -8(%rcx), %rax
movq %rax, %r8
shrq $3, %r8
addq $1, %r8
testq %rax, %rax
je .LBB0_10
# %bb.11: # %vector.ph.new
movq %r8, %rax
andq $-2, %rax
negq %rax
xorl %edi, %edi
.p2align 4, 0x90
.LBB0_12: # %vector.body
# =>This Inner Loop Header: Depth=1
movups (%rsi,%rdi,4), %xmm2
movups 16(%rsi,%rdi,4), %xmm3
mulps %xmm1, %xmm2
mulps %xmm1, %xmm3
movups (%rdx,%rdi,4), %xmm4
addps %xmm2, %xmm4
movups 16(%rdx,%rdi,4), %xmm2
addps %xmm3, %xmm2
movups 32(%rdx,%rdi,4), %xmm3
movups 48(%rdx,%rdi,4), %xmm5
movups %xmm4, (%rdx,%rdi,4)
movups %xmm2, 16(%rdx,%rdi,4)
movups 32(%rsi,%rdi,4), %xmm2
movups 48(%rsi,%rdi,4), %xmm4
mulps %xmm1, %xmm2
addps %xmm3, %xmm2
mulps %xmm1, %xmm4
addps %xmm5, %xmm4
movups %xmm2, 32(%rdx,%rdi,4)
movups %xmm4, 48(%rdx,%rdi,4)
addq $16, %rdi
addq $2, %rax
jne .LBB0_12
# %bb.13: # %middle.block.unr-lcssa
testb $1, %r8b
je .LBB0_15
.LBB0_14: # %vector.body.epil
movups (%rsi,%rdi,4), %xmm2
movups 16(%rsi,%rdi,4), %xmm3
mulps %xmm1, %xmm2
mulps %xmm1, %xmm3
movups (%rdx,%rdi,4), %xmm1
addps %xmm2, %xmm1
movups 16(%rdx,%rdi,4), %xmm2
addps %xmm3, %xmm2
movups %xmm1, (%rdx,%rdi,4)
movups %xmm2, 16(%rdx,%rdi,4)
.LBB0_15: # %middle.block
cmpq %r9, %rcx
jne .LBB0_3
.LBB0_16: # %for.cond.cleanup
retq
.LBB0_10:
xorl %edi, %edi
testb $1, %r8b
jne .LBB0_14
jmp .LBB0_15
.Lfunc_end0:
.size saxpy, .Lfunc_end0-saxpy
.cfi_endproc
# -- End function
.ident "AMD clang version 12.0.0 (CLANG: AOCC_3.0.0-Build#78 2020_12_10) (based on LLVM Mirror.Version.12.0.0)"
.section ".note.GNU-stack","",@progbits
.addrsig
Another test on NVHPC, actually you can hack the backend CPU part using AOCC with nvc -march=zen2 -Mvect=simd:256 -Mcache_align -fma -S a.c
.
; ModuleID = 'a.c'
target datalayout = "e-p:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"
define internal void @pgCplus_compiled.() noinline {
L.entry:
ret void
}
define void @saxpy(i32 signext %n.arg, float %a.arg, float* %x.arg, float* %y.arg) #0 !dbg !17 {
L.entry:
%n.addr = alloca i32, align 4
%a.addr = alloca float, align 4
%x.addr = alloca float*, align 8
%y.addr = alloca float*, align 8
%.ndi0002.addr = alloca i32, align 4
%.ndi0003.addr = alloca i32, align 4
%.vv0000.addr = alloca i8*, align 8
%.vv0001.addr = alloca i8*, align 8
%.vv0002.addr = alloca i8*, align 8
%.r1.0148.addr = alloca <8 x float>, align 4
%.lcr010001.addr = alloca i32, align 4
store i32 %n.arg, i32* %n.addr, align 4, !tbaa !29
store float %a.arg, float* %a.addr, align 4, !tbaa !29
store float* %x.arg, float** %x.addr, align 8, !tbaa !30
store float* %y.arg, float** %y.addr, align 8, !tbaa !30
%0 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
%1 = icmp sle i32 %0, 0, !dbg !23
br i1 %1, label %L.B0005, label %L.B0014, !dbg !23
L.B0014:
%2 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
%3 = bitcast float* %2 to i8*, !dbg !23
%4 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
%5 = bitcast float* %4 to i8*, !dbg !23
%6 = ptrtoint i8* %5 to i64, !dbg !23
%7 = sub i64 0, %6, !dbg !23
%8 = getelementptr i8, i8* %3, i64 %7, !dbg !23
%9 = icmp ule i8* %8, null, !dbg !23
br i1 %9, label %L.B0008, label %L.B0015, !dbg !23
L.B0015:
%10 = bitcast float* %2 to i8*, !dbg !23
%11 = bitcast float* %4 to i8*, !dbg !23
%12 = ptrtoint i8* %11 to i64, !dbg !23
%13 = sub i64 0, %12, !dbg !23
%14 = getelementptr i8, i8* %10, i64 %13, !dbg !23
%15 = inttoptr i64 32 to i8*, !dbg !23
%16 = icmp ult i8* %14, %15, !dbg !23
br i1 %16, label %L.B0007, label %L.B0008, !dbg !23
L.B0008:
store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%17 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
store i32 %17, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%18 = icmp slt i32 %17, 8, !dbg !23
br i1 %18, label %L.B0011, label %L.B0016, !dbg !23
L.B0016:
store i8* null, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
%19 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
%20 = bitcast float* %19 to i8*, !dbg !23
store i8* %20, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23
%21 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
%22 = bitcast float* %21 to i8*, !dbg !23
store i8* %22, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23
%23 = sub i32 %17, 7, !dbg !23
store i32 %23, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%24 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
%25 = insertelement <8 x float> undef, float %24, i32 0, !dbg !23
%26 = shufflevector <8 x float> %25, <8 x float> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, !dbg !23
store <8 x float> %26, <8 x float>* %.r1.0148.addr, align 1, !tbaa !29, !dbg !23
br label %L.B0012
L.B0012:
%27 = load <8 x float>, <8 x float>* %.r1.0148.addr, align 4, !tbaa !29, !dbg !23
%28 = load i8*, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23
%29 = load i8*, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
%30 = ptrtoint i8* %29 to i64, !dbg !23
%31 = getelementptr i8, i8* %28, i64 %30, !dbg !23
%32 = bitcast i8* %31 to <8 x float>*, !dbg !23
%33 = load <8 x float>, <8 x float>* %32, align 4, !tbaa !29, !dbg !23
%34 = load i8*, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23
%35 = getelementptr i8, i8* %34, i64 %30, !dbg !23
%36 = bitcast i8* %35 to <8 x float>*, !dbg !23
%37 = load <8 x float>, <8 x float>* %36, align 4, !tbaa !29, !dbg !23
%38 = call <8 x float> @llvm.fma.v8f32 (<8 x float> %27, <8 x float> %33, <8 x float> %37), !dbg !23
store <8 x float> %38, <8 x float>* %36, align 1, !tbaa !29, !dbg !23
%39 = getelementptr i8, i8* %29, i64 32, !dbg !23
store i8* %39, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
%40 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%41 = sub i32 %40, 8, !dbg !23
store i32 %41, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%42 = icmp sgt i32 %41, 0, !dbg !23
br i1 %42, label %L.B0012, label %L.B0017, !llvm.loop !24, !dbg !23
L.B0017:
%43 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%44 = add i32 %43, 7, !dbg !23
store i32 %44, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%45 = icmp eq i32 %44, 0, !dbg !23
br i1 %45, label %L.B0013, label %L.B0018, !dbg !23
L.B0018:
%46 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
%47 = and i32 %46, -8, !dbg !23
store i32 %47, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
br label %L.B0011
L.B0011:
%48 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%49 = sext i32 %48 to i64, !dbg !23
%50 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
%51 = getelementptr float, float* %50, i64 %49, !dbg !23
%52 = load float, float* %51, align 4, !tbaa !29, !dbg !23
%53 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
%54 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
%55 = getelementptr float, float* %54, i64 %49, !dbg !23
%56 = load float, float* %55, align 4, !tbaa !29, !dbg !23
%57 = call float @llvm.fma.f32 (float %53, float %56, float %52), !dbg !23
store float %57, float* %51, align 4, !tbaa !29, !dbg !23
%58 = add i32 %48, 1, !dbg !23
store i32 %58, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%59 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%60 = sub i32 %59, 1, !dbg !23
store i32 %60, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%61 = icmp sgt i32 %60, 0, !dbg !23
br i1 %61, label %L.B0011, label %L.B0013, !llvm.loop !24, !dbg !23
L.B0013:
br label %L.B0009, !dbg !23
L.B0007:
store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%62 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
store i32 %62, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23
br label %L.B0010
L.B0010:
%63 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%64 = sext i32 %63 to i64, !dbg !23
%65 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
%66 = getelementptr float, float* %65, i64 %64, !dbg !23
%67 = load float, float* %66, align 4, !tbaa !29, !dbg !23
%68 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
%69 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
%70 = getelementptr float, float* %69, i64 %64, !dbg !23
%71 = load float, float* %70, align 4, !tbaa !29, !dbg !23
%72 = call float @llvm.fma.f32 (float %68, float %71, float %67), !dbg !23
store float %72, float* %66, align 4, !tbaa !29, !dbg !23
%73 = add i32 %63, 1, !dbg !23
store i32 %73, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%74 = load i32, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23
%75 = icmp slt i32 %73, %74, !dbg !23
br i1 %75, label %L.B0010, label %L.B0009, !dbg !23
L.B0009:
br label %L.B0005
L.B0005:
ret void, !dbg !26
}
declare float @llvm.fma.f32(float, float, float)
declare <8 x float> @llvm.fma.v8f32(<8 x float>, <8 x float>, <8 x float>)
declare i32 @__gxx_personality_v0(...)
; Named metadata
!llvm.module.flags = !{ !1, !2 }
!llvm.dbg.cu = !{ !10 }
; Metadata
!1 = !{ i32 2, !"Dwarf Version", i32 4 }
!2 = !{ i32 2, !"Debug Info Version", i32 3 }
!3 = !DIFile(filename: "a.c", directory: "/home/victoryang")
; !4 = !DIFile(tag: DW_TAG_file_type, pair: !3)
!4 = !{ i32 41, !3 }
!5 = !{ }
!6 = !{ }
!7 = !{ !17 }
!8 = !{ }
!9 = !{ }
!10 = distinct !DICompileUnit(file: !3, language: DW_LANG_C_plus_plus, producer: " NVC++ 21.5-0", enums: !5, retainedTypes: !6, globals: !8, emissionKind: FullDebug, imports: !9)
!11 = !DIBasicType(tag: DW_TAG_base_type, name: "int", size: 32, align: 32, encoding: DW_ATE_signed)
!12 = !DIBasicType(tag: DW_TAG_base_type, name: "float", size: 32, align: 32, encoding: DW_ATE_float)
!13 = !DIDerivedType(tag: DW_TAG_pointer_type, size: 64, align: 64, baseType: !12)
!14 = !{ null, !11, !12, !13, !13 }
!15 = !DISubroutineType(types: !14)
!16 = !{ }
!17 = distinct !DISubprogram(file: !3, scope: !10, name: "saxpy", line: 2, type: !15, spFlags: 8, unit: !10, scopeLine: 2)
!18 = !DILocation(line: 2, column: 1, scope: !17)
!19 = !DILexicalBlock(file: !3, scope: !17, line: 2, column: 1)
!20 = !DILocation(line: 2, column: 1, scope: !19)
!21 = !DILexicalBlock(file: !3, scope: !19, line: 2, column: 1)
!22 = !DILocation(line: 2, column: 1, scope: !21)
!23 = !DILocation(line: 3, column: 1, scope: !21)
!24 = !{ !24, !25 }
!25 = !{ !"llvm.loop.vectorize.enable", i1 0 }
!26 = !DILocation(line: 5, column: 1, scope: !19)
!27 = !{ !"PGI C[++] TBAA" }
!28 = !{ !"omnipotent char", !27, i64 0 }
!29 = !{ !28, !28, i64 0 }
!30 = !{ !"<T>*", !28, i64 0 }
!31 = !{ !"int", !28, i64 0 }
!32 = !{ !31, !31, i64 0 }
!33 = !{ !"float", !28, i64 0 }
!34 = !{ !33, !33, i64 0 }
and
.text
.file "a.ll"
.globl saxpy # -- Begin function saxpy
.p2align 4, 0x90
.type saxpy,@function
saxpy: # @saxpy
.Lfunc_begin0:
.file 1 "/home/victoryang/a.c"
.loc 1 2 0 # a.c:2:0
.cfi_sections .debug_frame
.cfi_startproc
# %bb.0: # %L.entry
.loc 1 3 1 prologue_end # a.c:3:1
testl %edi, %edi
jle .LBB0_19
# %bb.1: # %L.B0014
movq %rdx, %rax
subq %rsi, %rax
je .LBB0_11
# %bb.2: # %L.B0014
cmpq $31, %rax
ja .LBB0_11
# %bb.3: # %L.B0010.preheader
movl %edi, %eax
cmpl $31, %edi
jbe .LBB0_4
# %bb.5: # %vector.memcheck
leaq (%rsi,%rax,4), %rcx
cmpq %rdx, %rcx
jbe .LBB0_7
# %bb.6: # %vector.memcheck
.loc 1 0 1 is_stmt 0 # a.c:0:1
leaq (%rdx,%rax,4), %rcx
.loc 1 3 1 # a.c:3:1
cmpq %rsi, %rcx
jbe .LBB0_7
.LBB0_4:
.loc 1 0 1 # a.c:0:1
xorl %ecx, %ecx
.p2align 4, 0x90
.LBB0_10: # %L.B0010
# =>This Inner Loop Header: Depth=1
.loc 1 3 1 # a.c:3:1
vmovss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
vfmadd213ss (%rdx,%rcx,4), %xmm0, %xmm1 # xmm1 = (xmm0 * xmm1) + mem
vmovss %xmm1, (%rdx,%rcx,4)
incq %rcx
cmpq %rcx, %rax
jne .LBB0_10
jmp .LBB0_19
.LBB0_11: # %L.B0008
.loc 1 0 1 # a.c:0:1
xorl %ecx, %ecx
.loc 1 3 1 # a.c:3:1
cmpl $8, %edi
jge .LBB0_13
# %bb.12:
.loc 1 0 1 # a.c:0:1
movl %edi, %eax
jmp .LBB0_17
.LBB0_13: # %L.B0016
.loc 1 3 1 # a.c:3:1
vbroadcastss %xmm0, %ymm1
xorl %ecx, %ecx
movl %edi, %eax
.p2align 4, 0x90
.LBB0_14: # %L.B0012
# =>This Inner Loop Header: Depth=1
vmovups (%rsi,%rcx), %ymm2
movl %eax, %r8d
vfmadd213ps (%rdx,%rcx), %ymm1, %ymm2 # ymm2 = (ymm1 * ymm2) + mem
leal -8(%r8), %eax
addl $-7, %r8d
vmovups %ymm2, (%rdx,%rcx)
addq $32, %rcx
cmpl $8, %r8d
jg .LBB0_14
# %bb.15: # %L.B0017
testl %eax, %eax
je .LBB0_19
# %bb.16: # %L.B0018
andl $-8, %edi
movl %edi, %ecx
.LBB0_17: # %L.B0011.preheader
incl %eax
.p2align 4, 0x90
.LBB0_18: # %L.B0011
# =>This Inner Loop Header: Depth=1
movslq %ecx, %rcx
decl %eax
vmovss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
vfmadd213ss (%rdx,%rcx,4), %xmm0, %xmm1 # xmm1 = (xmm0 * xmm1) + mem
vmovss %xmm1, (%rdx,%rcx,4)
incl %ecx
cmpl $1, %eax
jg .LBB0_18
.Ltmp0:
.LBB0_19: # %L.B0005
.loc 1 5 1 is_stmt 1 # a.c:5:1
vzeroupper
retq
.LBB0_7: # %vector.ph
.Ltmp1:
.loc 1 3 1 # a.c:3:1
vbroadcastss %xmm0, %ymm1
movl %eax, %ecx
xorl %edi, %edi
andl $-32, %ecx
.p2align 4, 0x90
.LBB0_8: # %vector.body
# =>This Inner Loop Header: Depth=1
vmovups (%rsi,%rdi,4), %ymm2
vmovups 32(%rsi,%rdi,4), %ymm3
vmovups 64(%rsi,%rdi,4), %ymm4
vmovups 96(%rsi,%rdi,4), %ymm5
vfmadd213ps (%rdx,%rdi,4), %ymm1, %ymm2 # ymm2 = (ymm1 * ymm2) + mem
vfmadd213ps 32(%rdx,%rdi,4), %ymm1, %ymm3 # ymm3 = (ymm1 * ymm3) + mem
vfmadd213ps 64(%rdx,%rdi,4), %ymm1, %ymm4 # ymm4 = (ymm1 * ymm4) + mem
vfmadd213ps 96(%rdx,%rdi,4), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
vmovups %ymm2, (%rdx,%rdi,4)
vmovups %ymm3, 32(%rdx,%rdi,4)
vmovups %ymm4, 64(%rdx,%rdi,4)
vmovups %ymm5, 96(%rdx,%rdi,4)
addq $32, %rdi
cmpq %rdi, %rcx
jne .LBB0_8
# %bb.9: # %middle.block
cmpq %rax, %rcx
jne .LBB0_10
jmp .LBB0_19
.Ltmp2:
.Lfunc_end0:
.size saxpy, .Lfunc_end0-saxpy
.cfi_endproc
# -- End function
.section .debug_abbrev,"",@progbits
.byte 1 # Abbreviation Code
.byte 17 # DW_TAG_compile_unit
.byte 1 # DW_CHILDREN_yes
.byte 37 # DW_AT_producer
.byte 14 # DW_FORM_strp
.byte 19 # DW_AT_language
.byte 5 # DW_FORM_data2
.byte 3 # DW_AT_name
.byte 14 # DW_FORM_strp
.byte 16 # DW_AT_stmt_list
.byte 23 # DW_FORM_sec_offset
.byte 27 # DW_AT_comp_dir
.byte 14 # DW_FORM_strp
.ascii "\264B" # DW_AT_GNU_pubnames
.byte 25 # DW_FORM_flag_present
.byte 17 # DW_AT_low_pc
.byte 1 # DW_FORM_addr
.byte 18 # DW_AT_high_pc
.byte 6 # DW_FORM_data4
.byte 0 # EOM(1)
.byte 0 # EOM(2)
.byte 2 # Abbreviation Code
.byte 46 # DW_TAG_subprogram
.byte 0 # DW_CHILDREN_no
.byte 17 # DW_AT_low_pc
.byte 1 # DW_FORM_addr
.byte 18 # DW_AT_high_pc
.byte 6 # DW_FORM_data4
.byte 64 # DW_AT_frame_base
.byte 24 # DW_FORM_exprloc
.byte 3 # DW_AT_name
.byte 14 # DW_FORM_strp
.byte 58 # DW_AT_decl_file
.byte 11 # DW_FORM_data1
.byte 59 # DW_AT_decl_line
.byte 11 # DW_FORM_data1
.byte 63 # DW_AT_external
.byte 25 # DW_FORM_flag_present
.byte 0 # EOM(1)
.byte 0 # EOM(2)
.byte 0 # EOM(3)
.section .debug_info,"",@progbits
.Lcu_begin0:
.long .Ldebug_info_end0-.Ldebug_info_start0 # Length of Unit
.Ldebug_info_start0:
.short 4 # DWARF version number
.long .debug_abbrev # Offset Into Abbrev. Section
.byte 8 # Address Size (in bytes)
.byte 1 # Abbrev [1] 0xb:0x35 DW_TAG_compile_unit
.long .Linfo_string0 # DW_AT_producer
.short 4 # DW_AT_language
.long .Linfo_string1 # DW_AT_name
.long .Lline_table_start0 # DW_AT_stmt_list
.long .Linfo_string2 # DW_AT_comp_dir
# DW_AT_GNU_pubnames
.quad .Lfunc_begin0 # DW_AT_low_pc
.long .Lfunc_end0-.Lfunc_begin0 # DW_AT_high_pc
.byte 2 # Abbrev [2] 0x2a:0x15 DW_TAG_subprogram
.quad .Lfunc_begin0 # DW_AT_low_pc
.long .Lfunc_end0-.Lfunc_begin0 # DW_AT_high_pc
.byte 1 # DW_AT_frame_base
.byte 87
.long .Linfo_string3 # DW_AT_name
.byte 1 # DW_AT_decl_file
.byte 2 # DW_AT_decl_line
# DW_AT_external
.byte 0 # End Of Children Mark
.Ldebug_info_end0:
.section .debug_str,"MS",@progbits,1
.Linfo_string0:
.asciz " NVC++ 21.5-0" # string offset=0
.Linfo_string1:
.asciz "a.c" # string offset=14
.Linfo_string2:
.asciz "/home/victoryang" # string offset=18
.Linfo_string3:
.asciz "saxpy" # string offset=35
.section .debug_pubnames,"",@progbits
.long .LpubNames_end0-.LpubNames_begin0 # Length of Public Names Info
.LpubNames_begin0:
.short 2 # DWARF Version
.long .Lcu_begin0 # Offset of Compilation Unit Info
.long 64 # Compilation Unit Length
.long 42 # DIE offset
.asciz "saxpy" # External Name
.long 0 # End Mark
.LpubNames_end0:
.section .debug_pubtypes,"",@progbits
.long .LpubTypes_end0-.LpubTypes_begin0 # Length of Public Types Info
.LpubTypes_begin0:
.short 2 # DWARF Version
.long .Lcu_begin0 # Offset of Compilation Unit Info
.long 64 # Compilation Unit Length
.long 0 # End Mark
.LpubTypes_end0:
.section ".note.GNU-stack","",@progbits
.section .debug_line,"",@progbits
.Lline_table_start0:
GCC arm SVE
对于超算来说应该介绍 Arm FX 64 的。但笔者觉得还是先科普一下 SVE 比较好,说不定下一次 ISC 就有了。
saxpy with neon
// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
saxpy_:
ldrswx3, [x3] // x3=*n
movx4, #0 // x4=i=0
ldrd0, [x2] // d0=*a
b .latch
.loop:
ldrd1, [x0, x4, lsl #3]// d1=x[i]
ldrd2, [x1, x4, lsl #3]// d2=y[i]
fmaddd2, d1, d0, d2. // d2+=x[i]*a
strd2, [x1, x4, lsl #3]// y[i]=d2
addx4, x4, #1 // i+=1
.latch:
cmpx4, x3 // i < n
b.lt .loop// more to do?
ret
saxpy with sve
// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
saxpy_:
ldrswx3, [x3] // x3=*n
movx4, #0 // x4=i=0
whileltp0.d, x4, x3 // p0=while(i++<n)
ld1rdz0.d, p0/z, [x2]// p0:z0=bcast(*a)
.loop:
ld1dz1.d, p0/z, [x0, x4, lsl #3]// p0:z1=x[i]
ld1dz2.d, p0/z, [x1, x4, lsl #3]// p0:z2=y[i]
fmlaz2.d, p0/m, z1.d, z0.d // p0?z2+=x[i]*a
st1dz2.d, p0, [x1, x4, lsl #3] // p0?y[i]=z2
incdx4 // i+=(VL/64)
.latch:
whileltp0.d, x4, x3 // p0=while(i++<n)
b.first .loop // more to do?
ret
There is no overhead in instruction count for the SVE version when compared to the equivalent scalar code, which allows a compiler to opportunistically vectorize loops withan unknown trip count.
-
16个可伸缩预测寄存器(P0-P15):普通的内存和算数操作的控制仅限于P0-P7。但是生成predicate的指令(向量比较)和依赖predicate的指令(逻辑操作)会使用全部寄存器P0-P15。通过分析编译和手动优化,这样的分配方案被验证有效并且减轻了predicate寄存器在其它架构上被观察到的压力
-
mixed element尺寸控制:每个predicate寄存器允许将最低粒度降低到字节水平,所以每个bit位对应8blt的数据宽度。
-
predicate条件:在SVE中产生predication的指令是复用NZCV条件码flags,这个NZCV有不同的解释
-
隐式顺序:predicate有一个隐式地从最低到最高位元素顺序解释,对应于一个等效的序列顺序。我们引用与此顺序有关的第一个和最后一个predicate elements以及它们的关联条件。
whileltp0
is to predict before the last max alignment which may cause throughput drain. OoO may shot the last element with low occpancy which lead to waste of this shot, alternatively, it could shot lower other (2^n) aligned instruction.
For hazard execution and speculation, you could easily doing gather load with z3 reg fault trap and reload.
OpenMP
The compiler support the openmp by default. The OpenMP standard for specifying threading in programming languages like C and Fortran is implemented in the compiler itself and as such is an integral part of the compiler in question. The OMP and POSIX thread library underneath can vary, but this is normally hidden from the user. OpenMP makes use of POSIX threads so both an OpenMP library and a POSIX thread library is needed. The POSIX thread library is normally supplied with the distribution (typically /usr/lib64/libpthread.so).
\[ \begin{array}{|l|c|c|} \hline \text { Compiler } & \text { Flag to select OpenMP } & \text { OpenMP version supported } \\ \hline \text { Intel compilers } & \text {-qopenmp } & \text { From } 17.0 \text { on : } 4.5 \\ \hline \text { GNU compilers } & \text {-fopenmp } & \text { From GCC 6.1 on : 4.5 } \\ \hline \text { PGI compilers } & -\mathrm{mp} & 4.5 \\ \hline \end{array} \]
You definitely need to watch Fanrui's PPT and understand the implementation of OpenMP in Clang.
Ref
- https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/lecture-slides/MIT6_172F18_lec9.pdf
- 程序员的自我修养
- 不同编译器的编译行为比较
- The ARM Scalable Vector Extension
- https://www.stonybrook.edu/commcms/ookami/support/_docs/1%20-%20Intro%20to%20A64FX.pdf
- https://llvm.org/devmtg/2017-10/slides/Ferguson-Enabling%20Parallel%20Computing%20in%20Chapel.pdf
Chisel
有关我们是否有必要学习一门逻辑电路描述语言去实现一个 CPU 或者一个路由器? 我认为是十分必要的,这让大家可以从底往上看你的 Data 和 Instruction 的变动。
Fortran
这个语言简直伤天害理,但确是笔者前老板最爱的语言,而且是f77,天灭fortran。可超算比赛 fortran 的上镜次数还挺多的,之前做天气应用的时候没敢改,现在既然有个小于 15w 行的程序,笔者尝试着修改。
PGI 编译器是一个商用编译器。后被 NV 收购,加了很多 fortran 可用的 cuda DSL。这无疑让 fortran 续命了不少。NVHPC 中的 Nvfortran 有很多编译器优化的 log 可以看。
基本语法: module 相当于 c 中的struct。 program(main)/function(normal function) 相当于 对function 的定义
real function square(x)
implicit none
real, intent(in) :: x
square = x * x
return
end function square
program main
integer :: n, i, errs, argcount
real, dimension(:), allocatable :: a, b, r, e
n = 1000000
call square(n)
end program
subroutine 相当于trait,需要有generic function 来实现
OpenACC
一个简单的加法
module mpoint
type point
real :: x, y, z
end type
type(point) :: base(1000)
end module
subroutine vecaddgpu( r, n )
use mpoint
type(point) :: r(:)
integer :: n
!$acc parallel loop present(base) copyout(r(:))
do i = 1, n
r(i)%x = base(i)%x
r(i)%y = sqrt( base(i)%y*base(i)%y + base(i)%z*base(i)%z )
r(i)%z = 0
enddo
end subroutine
Mind to use Makefile to see the Optimization info from the compiler. Also checkt the identifier loop present and copyout specify the gpu to run on.
nvfortran -MInfo -Mbounds
During the runtime, you can see the symbol and source file and using which GPU.
NVCOMPILER_ACC_NOTIFY=1 /root/yyw/cmake-openacc/cmake-build-debug-nvhpc/acc_test
Let's compared with the Kernel version. Both option -O0 -g
#include <iostream>
#include <cassert>
#include <cuda_runtime.h>
__global__ void vecaddgpu(int **a, int **b, int **c, int i) {
*c[i] = *a[i] + *b[i];
}
int main(void) {
int n = 1000000000;
int *a = static_cast<int *>(malloc(n * sizeof(int)));
int *b = static_cast<int *>(malloc(n * sizeof(int)));
int *c = static_cast<int *>(malloc(n * sizeof(int))); // host copies of a, b, c
int *e = static_cast<int *>(malloc(n * sizeof(int))); // result
int **d_a, **d_b, **d_c; // device copies of a, b, c
int size = sizeof(int);
int err = 0;
for (int i = 0; i < n; i++) {
a[i] = i;
b[i] = 1000 * i;
e[i] = a[i] + b[i];
}
// Allocate space for device copies of a, b, c
cudaMalloc((void **) &d_a, size * n);
cudaMalloc((void **) &d_b, size * n);
cudaMalloc((void **) &d_c, size * n);
// Copy inputs to device
cudaMemcpy(d_a, reinterpret_cast<const void *>(a), size * n, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, reinterpret_cast<const void *>(b), size * n, cudaMemcpyHostToDevice);
// Launch vecaddgpu() kernel on GPU with N blocks
vecaddgpu<<<1, 1024>>>(d_a, d_b, d_c, n);
// Copy result back to host
cudaMemcpy(c, d_c, size * n, cudaMemcpyDeviceToHost);
// Cleanup
for (int i = 0; i < n; i++) {
if (c[i] != e[i])
err++;
}
free(a);
free(b);
free(c);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
效率对比
pure cuda kernel is 1.5x faster.
The compiler option between different compiler
Go
此语言是个非常简单上手的语言,同时由于大厂使用过多,市面上开源的工具都非常可用。笔者在实习的时候学习的,module
等包管理工具在并行文件系统上面的 channel
+ context
重新实现非常快,那个代码也就数十行处理一些并行质询的状态机即可,也用其写过一些 eBPF 的代码用于更好的获得文件 IO 的实时性能。
一些小工具
Rust
Rust 是好活,自从我校2016年以 Rust 语言开设 CS100(程序语言设计)开始,上科大就成为了宣传Rust的堡垒,中国 Rust 之父张汉东先生及宣发 Rust 的各界人士选择推广 Rust 的最佳地点就会选择上科大,这雨与 riscv 类似。写小的bash工具( Zero Cost Abstraction 的 cffi )、写大型(10w+ line)的系统方向程序必备。由于语言特性有很多的静态检查,会指导大家对于内存管理,异步编程有更深刻的理解。
学习参考资料
- rCore - 清华维护的教学操作系统
- Libra - 脸书维护的区块链数据库
- 飞书 - Tokio 代码池
异步中的Sync
和 Send
https://kaisery.github.io/trpl-zh-cn/ch16-04-extensible-concurrency-sync-and-send.html
Send
use std::rc::Rc; use std::sync::Mutex; use std::thread; fn main() { let num = Rc::new(Mutex::new(0)); let mut handlers = vec![]; for i in 1..10 { let num_copy = num.clone(); let handle = thread::spawn(move || { *num_copy.lock().unwrap() += 1; }); handlers.push(handle); } for handler in handlers { handler.join(); } println!("{}", num.lock().unwrap()); }
这段代目不能通过编译, 原因是num_copy
在move
到线程中时,可能会多线程同时修改引用计数。所以在Rust中,Rc
没有Send
trait,因为它不允许在线程间转移所有权。
error[E0277]: `Rc<Mutex<i32>>` cannot be sent between threads safely
--> src/main.rs:10:22
|
10 | let handle = thread::spawn(move || {
| ______________________^^^^^^^^^^^^^_-
| | |
| | `Rc<Mutex<i32>>` cannot be sent between threads safely
11 | | *num_copy.lock().unwrap() += 1;
12 | | });
| |_________- within this `[closure@src/main.rs:10:36: 12:10]`
|
::: /home/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:624:8
|
624 | F: Send + 'static,
| ---- required by this bound in `spawn`
|
= help: within `[closure@src/main.rs:10:36: 12:10]`, the trait `Send` is not implemented for `Rc<Mutex<i32>>`
= note: required because it appears within the type `[closure@src/main.rs:10:36: 12:10]`
Sync
use std::cell::Cell; use std::thread; fn main() { let num = Cell::new(0); let mut handlers = vec![]; for i in 1..10 { let handle = thread::spawn(|| { num.set(i); }); handlers.push(handle); } for h in handlers { h.join(); } println!("{}", num.get()); }
这段代目不能通过编译, 原因是Cell<T>
我们在多个线程中共享&Cell<T>
时, 多可线程可以并发地修改其内部的值,这并不安全。所以Cell<T>
实现了!Sync
trait。
error[E0277]: `Cell<i32>` cannot be shared between threads safely
--> src/main.rs:8:22
|
8 | let handle = thread::spawn(|| {
| ^^^^^^^^^^^^^ `Cell<i32>` cannot be shared between threads safely
|
::: /home/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:624:8
|
624 | F: Send + 'static,
| ---- required by this bound in `spawn`
|
= help: the trait `Sync` is not implemented for `Cell<i32>`
= note: required because of the requirements on the impl of `Send` for `&Cell<i32>`
= note: required because it appears within the type `[closure@src/main.rs:8:36: 10:10]`
Libs
这里放置一些常用库的安装和使用事项。
SVML
fuck numa in linux but fine in epyc
Every 2.0s: numastat epyc.node1: Mon Aug 30 07:17:40 2021
node0 node1
numa_hit 11605557098 17090418391
numa_miss 0 0
numa_foreign 0 0
interleave_hit 83929 83526
local_node 11605248266 17089868634
other_node 308832 549757
Boost
Spack
spack info boost
spack install boost
Source
./bootstrap.sh --help
# Select your configuration options and invoke ./bootstrap.sh again without the --help option. Unless you have write permission in your system's /usr/local/ directory, you'll probably want to at least use
./bootstrap.sh --prefix=path/to/installation/prefix
# to install somewhere else. Also, consider using the --show-libraries and --with-libraries=library-name-list options to limit the long wait you'll experience if you build everything. Finally,
./b2 install
# will leave Boost binaries in the lib/ subdirectory of your installation prefix. You will also find a copy of the Boost headers in the include/ subdirectory of the installation prefix, so you can henceforth use that directory as an #include path in place of the Boost root directory.
# and add to PATH and LD and INCLUDE
版本相关问题
This is Version 3 of the Filesystem library. Version 2 is not longer supported. 1.49.0 was the last release of Boost to supply Version 2。
ArmForge
Arm Forge 是一个 Arm 公司出品的对高性能程序的软件。最强大的地方就是他对 CPU GPU 都非常适用,包括 Arm DDT 和 Arm MAP 的工具。Arm DDT 是业界领先的并行调试器,支持 MPI、CUDA 和 OpenMP。Arm MAP是用于MPI、OpenMP和 Vectorized 程序的低开销线级剖析器。
- 下载 Arm Forge(DDT + MAP),已使用 spack 部署
- Arm Forge用户指南
uProf
AMD 出品的一款 perf 工具,增加了一些 X86 超集的 metrics,但 UI 比较丑。
x86 subset
You can refer to the CONFIG_X86_AMD_PSTATE
tuning on ZenStates-Linux. The perf subset could be found in the linux perf tools. Also check /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
to see if the governor is available. The feature is switched on from Linux 5.17 (Ubuntu 22.04).
Reference
- https://faculty.sites.uci.edu/zhouli/files/2022/01/oakland22.pdf
- https://indico.cern.ch/event/730908/contributions/3153163/attachments/1730954/2810149/epyc.pdf
- https://www.nextplatform.com/2019/08/15/a-deep-dive-into-amds-rome-epyc-architecture/
- https://github.com/FPSG-UIUC/lotr
Vtune
都说做体系结构的人最懂做 Profiling,一个好的 Profiling 工具一定是有好的对 CPU 实时性能的抽象,最简单的命令是 rdstc
, arm上也有类似的实现
ABI 支持
profiler 需要有对 Intel Proceccor 各种 metrics 的函数实现,在安装Vtune的过程中会编译对当前系统的 PMU 的动态链接库,意为对 Intel PMU 的 ABI支持,我队使用的 epyc 有epyc适用的魔改版 PMU Tools。现在体系结构安全界学术主流对 PMU 的研究很深,因为其泄露了部分对 CPU 实时的状态,可以从中获取想要的东西。
X86 需要支持的 perf 参数比较有限,linux从kprobe,uprobe这些官方支持的 microprocessor sampling有很有用的,这些已被eBPF所采用。
Intel Compiler 在编译 broadwell 以上架构优化时主要做了三件对性能影响很大的事情:
- 激进的跨 basic block 优化 + Vectorization + Loop Unroll
- Load 和 Store 在满足 TSO 条件下的激进的重排,同时激进的整合数据,支持 store buffer bypass,movnt。同时也是 icc 后端 Bug 的主要来源。也是大厂不太用他的原因。除 HPC 外,大家一般照 gcc 标准。
- 自己维护的 TBB 线程池(非常快),自己维护的 malloc_align,自己维护的相关库。
有关如何更好的适配 Intel 的CPU,可以参考 Lammps 的 Intel Package。其使用访问者模式对 Intel processor的寄存器资源。
有关对用户态文件系统的适配,Vtune 提供了对 PM 带宽的实时测试,这个 metrics 貌似很难拿到。
在集群上如何使用
spack load intel-parallel-studio # choose the right version
amplxe-cl
Spack 简单教程
❯ spack info gcc
AutotoolsPackage: gcc
Description:
The GNU Compiler Collection includes front ends for C, C++, Objective-C,
Fortran, Ada, and Go, as well as libraries for these languages.
Homepage: https://gcc.gnu.org
Maintainers: @michaelkuhn @alalazo
Externally Detectable:
True (version, variants)
Tags:
None
Preferred version:
11.2.0 https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz
Safe versions:
master [git] git://gcc.gnu.org/git/gcc.git on branch master
11.2.0 https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz
11.1.0 https://ftpmirror.gnu.org/gcc/gcc-11.1.0/gcc-11.1.0.tar.xz
10.3.0 https://ftpmirror.gnu.org/gcc/gcc-10.3.0/gcc-10.3.0.tar.xz
10.2.0 https://ftpmirror.gnu.org/gcc/gcc-10.2.0/gcc-10.2.0.tar.xz
10.1.0 https://ftpmirror.gnu.org/gcc/gcc-10.1.0/gcc-10.1.0.tar.xz
9.4.0 https://ftpmirror.gnu.org/gcc/gcc-9.4.0/gcc-9.4.0.tar.xz
9.3.0 https://ftpmirror.gnu.org/gcc/gcc-9.3.0/gcc-9.3.0.tar.xz
9.2.0 https://ftpmirror.gnu.org/gcc/gcc-9.2.0/gcc-9.2.0.tar.xz
9.1.0 https://ftpmirror.gnu.org/gcc/gcc-9.1.0/gcc-9.1.0.tar.xz
8.5.0 https://ftpmirror.gnu.org/gcc/gcc-8.5.0/gcc-8.5.0.tar.xz
8.4.0 https://ftpmirror.gnu.org/gcc/gcc-8.4.0/gcc-8.4.0.tar.xz
8.3.0 https://ftpmirror.gnu.org/gcc/gcc-8.3.0/gcc-8.3.0.tar.xz
8.2.0 https://ftpmirror.gnu.org/gcc/gcc-8.2.0/gcc-8.2.0.tar.xz
8.1.0 https://ftpmirror.gnu.org/gcc/gcc-8.1.0/gcc-8.1.0.tar.xz
7.5.0 https://ftpmirror.gnu.org/gcc/gcc-7.5.0/gcc-7.5.0.tar.xz
7.4.0 https://ftpmirror.gnu.org/gcc/gcc-7.4.0/gcc-7.4.0.tar.xz
7.3.0 https://ftpmirror.gnu.org/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz
7.2.0 https://ftpmirror.gnu.org/gcc/gcc-7.2.0/gcc-7.2.0.tar.xz
7.1.0 https://ftpmirror.gnu.org/gcc/gcc-7.1.0/gcc-7.1.0.tar.bz2
6.5.0 https://ftpmirror.gnu.org/gcc/gcc-6.5.0/gcc-6.5.0.tar.bz2
6.4.0 https://ftpmirror.gnu.org/gcc/gcc-6.4.0/gcc-6.4.0.tar.bz2
6.3.0 https://ftpmirror.gnu.org/gcc/gcc-6.3.0/gcc-6.3.0.tar.bz2
6.2.0 https://ftpmirror.gnu.org/gcc/gcc-6.2.0/gcc-6.2.0.tar.bz2
6.1.0 https://ftpmirror.gnu.org/gcc/gcc-6.1.0/gcc-6.1.0.tar.bz2
5.5.0 https://ftpmirror.gnu.org/gcc/gcc-5.5.0/gcc-5.5.0.tar.bz2
5.4.0 https://ftpmirror.gnu.org/gcc/gcc-5.4.0/gcc-5.4.0.tar.bz2
5.3.0 https://ftpmirror.gnu.org/gcc/gcc-5.3.0/gcc-5.3.0.tar.bz2
5.2.0 https://ftpmirror.gnu.org/gcc/gcc-5.2.0/gcc-5.2.0.tar.bz2
5.1.0 https://ftpmirror.gnu.org/gcc/gcc-5.1.0/gcc-5.1.0.tar.bz2
4.9.4 https://ftpmirror.gnu.org/gcc/gcc-4.9.4/gcc-4.9.4.tar.bz2
4.9.3 https://ftpmirror.gnu.org/gcc/gcc-4.9.3/gcc-4.9.3.tar.bz2
4.9.2 https://ftpmirror.gnu.org/gcc/gcc-4.9.2/gcc-4.9.2.tar.bz2
4.9.1 https://ftpmirror.gnu.org/gcc/gcc-4.9.1/gcc-4.9.1.tar.bz2
4.8.5 https://ftpmirror.gnu.org/gcc/gcc-4.8.5/gcc-4.8.5.tar.bz2
4.8.4 https://ftpmirror.gnu.org/gcc/gcc-4.8.4/gcc-4.8.4.tar.bz2
4.7.4 https://ftpmirror.gnu.org/gcc/gcc-4.7.4/gcc-4.7.4.tar.bz2
4.6.4 https://ftpmirror.gnu.org/gcc/gcc-4.6.4/gcc-4.6.4.tar.bz2
4.5.4 https://ftpmirror.gnu.org/gcc/gcc-4.5.4/gcc-4.5.4.tar.bz2
Variants:
Name [Default] Allowed values Description
========================= ==================== ===================================================
binutils [off] on, off Build via binutils
bootstrap [on] on, off Enable 3-stage bootstrap
graphite [off] on, off Enable Graphite loop optimizations (requires ISL)
languages [c,c++,fortran] ada, brig, c, c++, Compilers and runtime libraries to build
fortran, go, java,
jit, lto, objc,
obj-c++
nvptx [off] on, off Target nvptx offloading to NVIDIA GPUs
piclibs [off] on, off Build PIC versions of libgfortran.a and libstdc++.a
strip [off] on, off Strip executables to reduce installation size
Installation Phases:
autoreconf configure build install
Build Dependencies:
binutils cuda diffutils flex gmp gnat iconv isl mpc mpfr zip zlib zstd
Link Dependencies:
binutils cuda gmp gnat iconv isl mpc mpfr zlib zstd
Run Dependencies:
binutils
Virtual Packages:
gcc@7: languages=go provides golang@:1.8
gcc@6: languages=go provides golang@:1.6.1
gcc@5: languages=go provides golang@:1.4
gcc@4.9: languages=go provides golang@:1.2
gcc@4.8.2: languages=go provides golang@:1.1.2
gcc@4.8: languages=go provides golang@:1.1
gcc@4.7.1: languages=go provides golang@:1
gcc@4.6: languages=go provides golang
得到相关依赖,可以查看你现在如果安装 spack 提供的依赖
❯ spack spec gcc
Input spec
--------------------------------
gcc
Concretized
--------------------------------
gcc@11.2.0%apple-clang@12.0.5~binutils+bootstrap~graphite~nvptx~piclibs~strip languages=c,c++,fortran patches=ecc5ac43951b34cbc5db15f585b4e704c42e2e487f9ed4c24fadef3f3857930b arch=darwin-bigsur-skylake
^diffutils@2.8.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^gmp@6.2.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^autoconf@2.71%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^automake@1.16.4%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^libtool@2.4.6%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^m4@1.4.6%apple-clang@12.0.5+sigsegv patches=c0a408fbffb7255fcc75e26bd8edab116fc81d216bfd18b473668b7739a4158e arch=darwin-bigsur-skylake
^libiconv@1.16%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^mpc@1.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^mpfr@4.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^autoconf-archive@2019.01.06%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^texinfo@4.8%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^zlib@1.2.11%apple-clang@12.0.5+optimize+pic+shared arch=darwin-bigsur-skylake
^zstd@1.5.0%apple-clang@12.0.5~ipo~legacy~lz4~lzma~multithread+programs+shared+static~zlib build_type=RelWithDebInfo arch=darwin-bigsur-skylake
^cmake@3.21.1%apple-clang@12.0.5~doc+ncurses+openssl+ownlibs~qt build_type=Release arch=darwin-bigsur-skylake
如果想使用特定依赖或者依赖系统的包,会在 ~/.spack/package
下得到,其使用方法可见 AMD
❯ spack external find
❯ cat ~/.spack/package.json
│ File: /Users/victoryang/.spack/packages.yaml
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ packages:
2 │ autoconf:
3 │ externals:
4 │ - spec: autoconf@2.71
5 │ prefix: /usr/local
6 │ automake:
7 │ externals:
8 │ - spec: automake@1.16.4
9 │ prefix: /usr/local
10 │ bash:
11 │ externals:
12 │ - spec: bash@3.2.57
13 │ prefix: /
14 │ bazel:
15 │ externals:
16 │ - spec: bazel@4.1.0
17 │ prefix: /usr/local
18 │ bison:
19 │ externals:
20 │ - spec: bison@2.3
21 │ prefix: /usr
22 │ bzip2:
23 │ externals:
24 │ - spec: bzip2@1.0.6
25 │ prefix: /usr
26 │ cmake:
27 │ externals:
28 │ - spec: cmake@3.21.1
29 │ prefix: /usr/local
30 │ diffutils:
31 │ externals:
32 │ - spec: diffutils@2.8.1
33 │ prefix: /usr
...
安装后的参数大致需要的有几个,-j N
是 job 个数,--no-checksum
不检查文件 md5,--no-restage
为在修改过的文件后继续编译,一般在 /tmp/root/spack-stage/spack-stage-amdscalapack-3.0-qwvyrumhsizxiaujwdsppcovijr5k5ri/spack-src/
. 有些包有 cflags cxxflags fcflags。 有些有 cuda_arch。碰到新的软件可以把所有需要的参数 append 到里面。
❯ spack install -j 8 --no-checksum llvm+mlir+flang+all_targets+python+shared_libs cflags="-O3" cxxflags="-O3"
[+] /usr/local (external cmake-3.21.1-cdhzbrts4k5ylrvlpspfl75zgeht4swi)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libiconv-1.16-ropgshv657ooz7kfzojv4s6srscgimnw
[+] /usr/local (external pkg-config-0.29.2-4nv7fo7lbjybt2u3xzb2vxzvgvaz5xmw)
[+] /usr/local (external xz-5.2.5-p37wr6fna4ysoh2xn2wnmmzttm3bi37o)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/zlib-1.2.11-lci2s4zd6x77rmexa3uuarbl5cvneskw
[+] /usr (external perl-5.30.2-4zkfgqml35km4ly7xmxn7ooz44dxtgqp)
[+] /usr/local (external python-3.9.6-shbb7dthsqe4lu26jugddyi2k7pl3jbl)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/pcre-8.44-g4df4jqpudoxhjsrubrqhv3uwxajofet
[+] /usr/local (external z3-4.8.12-hvhfxnxuachtpi524zf55znqn55vanod)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/ncurses-6.2-xilcz3bhw4otebvysduddyldezxhxvy6
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libxml2-2.9.10-mlrnjcbnjt3w7635xrietes7terwhko6
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/perl-data-dumper-2.173-cv4kwshixb7tmk6p7icxrqpicppkx5gr
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-setuptools-50.3.2-hwyhyijgi3yjokddm67tb6aulefteudx
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/swig-4.0.2-vajpijk4isacr52dzgk2gqbvyunadwkc
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libedit-3.1-20210216-6h4xokftdnxe2h3o7tie2cnbzbhfrr4h
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/hwloc-2.5.0-z2brjfcvnend5gorjmeqqgirccqerdwd
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-six-1.15.0-c63zkkdjpvegqai2f4jjg4mutsuchoov
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/llvm-12.0.1-n6c5z7sqfo7olnaqswu7jqhcdkyyk6nh
以依赖 nvhpc 编译器的 hdf5 为例。笔者碰到的问题有给定 mpi 找不到对应 wrapper 的 nvc 以及 nvfortran。在手动编译的时候会在 PATH 里面找 cc,或者直接找 CC,FC,CXX。这时候需要定义一个 FC。
if spec.compiler.fc="nvfortran":
env.set("FC","/path/to/mpifort wrapper")
args.append(("CMAKE_Fortran_COMPILER",spec.complier.mpifort))
在超算上的 spack 有两个 upstream,你觉得重要可以直接给原 repo PR,一般我们备份到学校内部的 gitlab。
Spack 与 Modules
当系统有 Modules 时,会自动把 Module file 的目录加到 MANPATH 下,也即立即可以使用 module load
有关Spack错误记录
$ spack load boost@1.70
==> Error: No compilers for operating system debian10 satisfy spec gcc@10.2.0
当出现这种错误时,可以检查一下 .spack/linux/compilers.yaml
是否包含了spack中的所有compiler。
Package
原始环境中常常有各种环境变量,这些环境变量在给 Spack 打包的时候可能会有影响,因此本 SOP 提出了一种用 docker 做环境隔离的方式,以便在不影响原始环境的情况下,可以在不同的环境中运行 Spack 的打包。
How to run
$ cd <Dockerfile-Folder>
$ docker build -t <image-name> .
$ docker run -it -d -v <spack-folder-your-machine>:<spack-folder-your-machine> --name <docker-name> <image-name>
$ docker exec -it <docker-name> bash
Dockerfile ( Long-Term Support )
FROM ubuntu:20.04
RUN apt-get update
RUN apt-get install -y ca-certificates
RUN sed -i "s/archive.ubuntu.com/mirrors.shanghaitech.edu.cn/g" /etc/apt/sources.list
RUN apt-get update
RUN apt-get install -y python python3 gcc build-essential wget nano vim gfortran curl less libnl-nf-3-200
Spack 简单教程
❯ spack info gcc
AutotoolsPackage: gcc
Description:
The GNU Compiler Collection includes front ends for C, C++, Objective-C,
Fortran, Ada, and Go, as well as libraries for these languages.
Homepage: https://gcc.gnu.org
Maintainers: @michaelkuhn @alalazo
Externally Detectable:
True (version, variants)
Tags:
None
Preferred version:
11.2.0 https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz
Safe versions:
master [git] git://gcc.gnu.org/git/gcc.git on branch master
11.2.0 https://ftpmirror.gnu.org/gcc/gcc-11.2.0/gcc-11.2.0.tar.xz
11.1.0 https://ftpmirror.gnu.org/gcc/gcc-11.1.0/gcc-11.1.0.tar.xz
10.3.0 https://ftpmirror.gnu.org/gcc/gcc-10.3.0/gcc-10.3.0.tar.xz
10.2.0 https://ftpmirror.gnu.org/gcc/gcc-10.2.0/gcc-10.2.0.tar.xz
10.1.0 https://ftpmirror.gnu.org/gcc/gcc-10.1.0/gcc-10.1.0.tar.xz
9.4.0 https://ftpmirror.gnu.org/gcc/gcc-9.4.0/gcc-9.4.0.tar.xz
9.3.0 https://ftpmirror.gnu.org/gcc/gcc-9.3.0/gcc-9.3.0.tar.xz
9.2.0 https://ftpmirror.gnu.org/gcc/gcc-9.2.0/gcc-9.2.0.tar.xz
9.1.0 https://ftpmirror.gnu.org/gcc/gcc-9.1.0/gcc-9.1.0.tar.xz
8.5.0 https://ftpmirror.gnu.org/gcc/gcc-8.5.0/gcc-8.5.0.tar.xz
8.4.0 https://ftpmirror.gnu.org/gcc/gcc-8.4.0/gcc-8.4.0.tar.xz
8.3.0 https://ftpmirror.gnu.org/gcc/gcc-8.3.0/gcc-8.3.0.tar.xz
8.2.0 https://ftpmirror.gnu.org/gcc/gcc-8.2.0/gcc-8.2.0.tar.xz
8.1.0 https://ftpmirror.gnu.org/gcc/gcc-8.1.0/gcc-8.1.0.tar.xz
7.5.0 https://ftpmirror.gnu.org/gcc/gcc-7.5.0/gcc-7.5.0.tar.xz
7.4.0 https://ftpmirror.gnu.org/gcc/gcc-7.4.0/gcc-7.4.0.tar.xz
7.3.0 https://ftpmirror.gnu.org/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz
7.2.0 https://ftpmirror.gnu.org/gcc/gcc-7.2.0/gcc-7.2.0.tar.xz
7.1.0 https://ftpmirror.gnu.org/gcc/gcc-7.1.0/gcc-7.1.0.tar.bz2
6.5.0 https://ftpmirror.gnu.org/gcc/gcc-6.5.0/gcc-6.5.0.tar.bz2
6.4.0 https://ftpmirror.gnu.org/gcc/gcc-6.4.0/gcc-6.4.0.tar.bz2
6.3.0 https://ftpmirror.gnu.org/gcc/gcc-6.3.0/gcc-6.3.0.tar.bz2
6.2.0 https://ftpmirror.gnu.org/gcc/gcc-6.2.0/gcc-6.2.0.tar.bz2
6.1.0 https://ftpmirror.gnu.org/gcc/gcc-6.1.0/gcc-6.1.0.tar.bz2
5.5.0 https://ftpmirror.gnu.org/gcc/gcc-5.5.0/gcc-5.5.0.tar.bz2
5.4.0 https://ftpmirror.gnu.org/gcc/gcc-5.4.0/gcc-5.4.0.tar.bz2
5.3.0 https://ftpmirror.gnu.org/gcc/gcc-5.3.0/gcc-5.3.0.tar.bz2
5.2.0 https://ftpmirror.gnu.org/gcc/gcc-5.2.0/gcc-5.2.0.tar.bz2
5.1.0 https://ftpmirror.gnu.org/gcc/gcc-5.1.0/gcc-5.1.0.tar.bz2
4.9.4 https://ftpmirror.gnu.org/gcc/gcc-4.9.4/gcc-4.9.4.tar.bz2
4.9.3 https://ftpmirror.gnu.org/gcc/gcc-4.9.3/gcc-4.9.3.tar.bz2
4.9.2 https://ftpmirror.gnu.org/gcc/gcc-4.9.2/gcc-4.9.2.tar.bz2
4.9.1 https://ftpmirror.gnu.org/gcc/gcc-4.9.1/gcc-4.9.1.tar.bz2
4.8.5 https://ftpmirror.gnu.org/gcc/gcc-4.8.5/gcc-4.8.5.tar.bz2
4.8.4 https://ftpmirror.gnu.org/gcc/gcc-4.8.4/gcc-4.8.4.tar.bz2
4.7.4 https://ftpmirror.gnu.org/gcc/gcc-4.7.4/gcc-4.7.4.tar.bz2
4.6.4 https://ftpmirror.gnu.org/gcc/gcc-4.6.4/gcc-4.6.4.tar.bz2
4.5.4 https://ftpmirror.gnu.org/gcc/gcc-4.5.4/gcc-4.5.4.tar.bz2
Variants:
Name [Default] Allowed values Description
========================= ==================== ===================================================
binutils [off] on, off Build via binutils
bootstrap [on] on, off Enable 3-stage bootstrap
graphite [off] on, off Enable Graphite loop optimizations (requires ISL)
languages [c,c++,fortran] ada, brig, c, c++, Compilers and runtime libraries to build
fortran, go, java,
jit, lto, objc,
obj-c++
nvptx [off] on, off Target nvptx offloading to NVIDIA GPUs
piclibs [off] on, off Build PIC versions of libgfortran.a and libstdc++.a
strip [off] on, off Strip executables to reduce installation size
Installation Phases:
autoreconf configure build install
Build Dependencies:
binutils cuda diffutils flex gmp gnat iconv isl mpc mpfr zip zlib zstd
Link Dependencies:
binutils cuda gmp gnat iconv isl mpc mpfr zlib zstd
Run Dependencies:
binutils
Virtual Packages:
gcc@7: languages=go provides golang@:1.8
gcc@6: languages=go provides golang@:1.6.1
gcc@5: languages=go provides golang@:1.4
gcc@4.9: languages=go provides golang@:1.2
gcc@4.8.2: languages=go provides golang@:1.1.2
gcc@4.8: languages=go provides golang@:1.1
gcc@4.7.1: languages=go provides golang@:1
gcc@4.6: languages=go provides golang
得到相关依赖,可以查看你现在如果安装 spack 提供的依赖
❯ spack spec gcc
Input spec
--------------------------------
gcc
Concretized
--------------------------------
gcc@11.2.0%apple-clang@12.0.5~binutils+bootstrap~graphite~nvptx~piclibs~strip languages=c,c++,fortran patches=ecc5ac43951b34cbc5db15f585b4e704c42e2e487f9ed4c24fadef3f3857930b arch=darwin-bigsur-skylake
^diffutils@2.8.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^gmp@6.2.1%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^autoconf@2.71%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^automake@1.16.4%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^libtool@2.4.6%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^m4@1.4.6%apple-clang@12.0.5+sigsegv patches=c0a408fbffb7255fcc75e26bd8edab116fc81d216bfd18b473668b7739a4158e arch=darwin-bigsur-skylake
^libiconv@1.16%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^mpc@1.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^mpfr@4.1.0%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^autoconf-archive@2019.01.06%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^texinfo@4.8%apple-clang@12.0.5 arch=darwin-bigsur-skylake
^zlib@1.2.11%apple-clang@12.0.5+optimize+pic+shared arch=darwin-bigsur-skylake
^zstd@1.5.0%apple-clang@12.0.5~ipo~legacy~lz4~lzma~multithread+programs+shared+static~zlib build_type=RelWithDebInfo arch=darwin-bigsur-skylake
^cmake@3.21.1%apple-clang@12.0.5~doc+ncurses+openssl+ownlibs~qt build_type=Release arch=darwin-bigsur-skylake
如果想使用特定依赖或者依赖系统的包,会在 ~/.spack/package
下得到,其使用方法可见 AMD
❯ spack external find
❯ cat ~/.spack/package.json
│ File: /Users/victoryang/.spack/packages.yaml
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ packages:
2 │ autoconf:
3 │ externals:
4 │ - spec: autoconf@2.71
5 │ prefix: /usr/local
6 │ automake:
7 │ externals:
8 │ - spec: automake@1.16.4
9 │ prefix: /usr/local
10 │ bash:
11 │ externals:
12 │ - spec: bash@3.2.57
13 │ prefix: /
14 │ bazel:
15 │ externals:
16 │ - spec: bazel@4.1.0
17 │ prefix: /usr/local
18 │ bison:
19 │ externals:
20 │ - spec: bison@2.3
21 │ prefix: /usr
22 │ bzip2:
23 │ externals:
24 │ - spec: bzip2@1.0.6
25 │ prefix: /usr
26 │ cmake:
27 │ externals:
28 │ - spec: cmake@3.21.1
29 │ prefix: /usr/local
30 │ diffutils:
31 │ externals:
32 │ - spec: diffutils@2.8.1
33 │ prefix: /usr
...
安装后的参数大致需要的有几个,-j N
是 job 个数,--no-checksum
不检查文件 md5,--no-restage
为在修改过的文件后继续编译,一般在 /tmp/root/spack-stage/spack-stage-amdscalapack-3.0-qwvyrumhsizxiaujwdsppcovijr5k5ri/spack-src/
. 有些包有 cflags cxxflags fcflags。 有些有 cuda_arch。碰到新的软件可以把所有需要的参数 append 到里面。
❯ spack install -j 8 --no-checksum llvm+mlir+flang+all_targets+python+shared_libs cflags="-O3" cxxflags="-O3"
[+] /usr/local (external cmake-3.21.1-cdhzbrts4k5ylrvlpspfl75zgeht4swi)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libiconv-1.16-ropgshv657ooz7kfzojv4s6srscgimnw
[+] /usr/local (external pkg-config-0.29.2-4nv7fo7lbjybt2u3xzb2vxzvgvaz5xmw)
[+] /usr/local (external xz-5.2.5-p37wr6fna4ysoh2xn2wnmmzttm3bi37o)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/zlib-1.2.11-lci2s4zd6x77rmexa3uuarbl5cvneskw
[+] /usr (external perl-5.30.2-4zkfgqml35km4ly7xmxn7ooz44dxtgqp)
[+] /usr/local (external python-3.9.6-shbb7dthsqe4lu26jugddyi2k7pl3jbl)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/pcre-8.44-g4df4jqpudoxhjsrubrqhv3uwxajofet
[+] /usr/local (external z3-4.8.12-hvhfxnxuachtpi524zf55znqn55vanod)
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/ncurses-6.2-xilcz3bhw4otebvysduddyldezxhxvy6
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libxml2-2.9.10-mlrnjcbnjt3w7635xrietes7terwhko6
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/perl-data-dumper-2.173-cv4kwshixb7tmk6p7icxrqpicppkx5gr
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-setuptools-50.3.2-hwyhyijgi3yjokddm67tb6aulefteudx
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/swig-4.0.2-vajpijk4isacr52dzgk2gqbvyunadwkc
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/libedit-3.1-20210216-6h4xokftdnxe2h3o7tie2cnbzbhfrr4h
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/hwloc-2.5.0-z2brjfcvnend5gorjmeqqgirccqerdwd
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/py-six-1.15.0-c63zkkdjpvegqai2f4jjg4mutsuchoov
[+] /Users/victoryang/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.5/llvm-12.0.1-n6c5z7sqfo7olnaqswu7jqhcdkyyk6nh
以依赖 nvhpc 编译器的 hdf5 为例。笔者碰到的问题有给定 mpi 找不到对应 wrapper 的 nvc 以及 nvfortran。在手动编译的时候会在 PATH 里面找 cc,或者直接找 CC,FC,CXX。这时候需要定义一个 FC。
if spec.compiler.fc="nvfortran":
env.set("FC","/path/to/mpifort wrapper")
args.append(("CMAKE_Fortran_COMPILER",spec.complier.mpifort))
在超算上的 spack 有两个 upstream,你觉得重要可以直接给原 repo PR,一般我们备份到学校内部的 gitlab。
Spack 与 Modules
当系统有 Modules 时,会自动把 Module file 的目录加到 MANPATH 下,也即立即可以使用 module load
有关Spack错误记录
$ spack load boost@1.70
==> Error: No compilers for operating system debian10 satisfy spec gcc@10.2.0
当出现这种错误时,可以检查一下 .spack/linux/compilers.yaml
是否包含了spack中的所有compiler。
Debug a package
spack cd <spec>
spack build-env <spec> <shell/command>
Architecture
这里放置有关计算机体系结构的资料
Memory Model
Memory Coherence
Memory coherence: a memory system is coherent if any read of a data item returns the most recently written value of that data item.
Coherent memory system:
- For a memory address wrote by a processor P, the next reads of P should get the written value.
- For a memory address wrote by a processor P1, after enough time, another processor P2 can get the value written by P1.
- The write operations for one memory address are serialized, so if there are 2 writes to a memory address by any processor, any processor cannot get two results in different order.
The coherence model does not define when a wrote by P1 can be read by P2. The memory consistency model is responsible for it.
Memory Consistency
Memory consistency: A memory consistency model for a shared address space specifies constraints on the order in which memory operations must appear to be performed (i.e. to become visible to the processors) with respect to one another.(when a written value will be returned/seen by a read).
The memory consistency model defines the order of operation pairs in different addresses.
Sequential consistency model
- In each processor, the read operation should always get the value of the last write operation in program order.
# Processor 1
Flag1 = 1
if (Flag2 == 0)
do sth
# Processor 2
Flag2 = 1
if (Flag1 == 0)
do sth
For P1, SC can guarantee that if the value of Flag2 is 0, the write of Flag1 happens before the write and read of P2. So there is at most one processor is in the do sth
section (Neither processor failed to get in the critical section is also possible).
- There is only one order visible to all processors. For two write operations
W1, W2
(can be done by different processors), each processors should get the same sequence.
# Processor 1
A = 1
# Processor 2
if (A == 1)
B = 1
# Processor 3
if (B == 1)
get(A)
If P3 gets the B == 1
, the value of A must be 1. Since the write sequence seen by P2 is A = 1 -> B = 1
.
Sequential consistency can produce non-deterministic results. This is because the sequence of sequential operations between processors can be different during different runs of the program. All memory operations need to happen in the program order.
Relaxed memory consistency models
Suppose A->B
means for one processor, the operation A
is done before operation B
.
If W->R
is violated, the order is Total Store Ordering. It is used by x86-64 architecture.
If W->W
is violated, the order is Partial Store Ordering.
...
More memory models can be seen from
https://en.wikipedia.org/wiki/Consistency_model
Synchronized-with and happens-before
The synchronized-with relationship is something that you can get only between suitably tagged (the default is memory_order_seq_cst
, which is a suitable tag) operations on atomic types (data structures like mutex contains these atomic types). If A writes on x and B reads on x, there is a synchronizes-with relationship between A and B.
The happens-before relationship specifies which operations see the effects of which other operations. For a single thread, the happens-before relationship can be easily determined by the program order. For multi-threading, if operation A on one thread inter-thread happens-before operation B on another thread, then A happens-before B. The inter-thread happens-before relies on the synchronizes-with relationship. If operation A in one thread synchronizes-with operation B in another thread, then A inter-thread happens-before B. This relationship is transitive.
These rules means if you make changes in one thread, you need only one synchronizes-with relationship for the data to be visible to subsequent operations on other threads.
C++ Memory Order
C++ has 6 memory ordering options on atomic types.
momory_order_relaxed,
memory_order_consume,
memory_order_acquire,
memory_order_release,
memory_order_acq_rel,
memory_order_seq_cst.
They represents 3 memory models:
Sequential Consistency (memory_order_seq_cst)
Relaxed (memory_order_relaxed)
Acquire-Release (memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel)
For x86-64 architecture, the acquire-release ordering do not require additional instructions. Sequential consistent ordering has small additional cost on store operations. But it will also influence the instruction reordering of compiler, so they all have potential cost except memory_order_relaxed
.
In non-sequentially consistent memory orderings, threads don’t have to agree on the order of events on atomic variables. In the absence of other ordering constraints, the only requirement is that all threads agree on the modification order of each individual variable.
std::memory_order_seq_cst
If all operations on instances of atomic types are sequentially consistent, the behavior of a multithreaded program is as if all these operations were performed in some particular sequence by a single thread. This is by far the easiest memory ordering to understand, which is why it’s the default: all threads must see the same order of operations. ... It also means that operations can’t be reordered; if your code has one operation before another in one thread, that ordering must be seen by all other threads.
A sequentially consistent store synchronizes-with a sequentially consistent load of the same variable that reads the value stored.
-- C++ concurrency in action 2nd edition, P124
std::memory_order_relaxed
Operations on atomic types performed with relaxed ordering don’t participate in synchronizes-with relationships. Operations on the same variable within a single thread still obey happens-before relationships, but there’s almost no requirement on ordering relative to other threads.
-- C++ concurrency in action 2nd edition, P127
# Processor 1
x.store(true, std::memory_order_relaxed);
y.store(true, std::memory_order_relaxed);
# Processor 2
while (!y.load(std::memory_order_relaxed));
if (x.load(std::memory_order_relaxed)) ++z;
Here z can be 0. Since there are no ordering guarantees relating to the visibility to different threads.
The relaxed ordering gives a well-defined behavior in multi-threading compared to volatile
keyword or only uses a normal variable. Since the sematic is atomic, the fetch_add
method is also atomic, which means you can use it as a counter. In x86, the fetch_add
method with std::memory_order_relaxed
is implemented as lock xadd
, which is same as using std::memory_order_seq_cst
(but the former one can be reordered by the compiler, and in other architectures, the implementation may be different).
Problem may encounter in using relaxed ordering: OOTA
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1217r2.html
std::memory_order_acquire && std::memory_order_release
std::memory_order_release only used in store()
, std::memory_order_acquire only used in load()
. The operation pair can form a synchronizes-with relationship.
- Any writes or reads before
store()
should not be moved after it. - Any writes or reads after
load()
should not be moved before it.
Typical usage:
# Processor 1
data = 100; // A
ready.store(true, std::memory_order_release); // B
# Processor 2
while (!ready.load(std::memory_order_acquire)) // C
;
assert(data == 100); // never failed // D
TODO: std::memory_order_consume
Double-checking locking
if (!x_init.load(memory_order_acquire)) {
lock_guard<mutex> _(x_init_mutex);
if (!x_init.load(memory_order_relaxed)) { // <- Already hold the lock!
initialize x;
x_init.store(true, memory_order_release);
}
}
Initial load for compare-exchange
unsigned long expected = x.load(memory_order_relaxed);
// <- result does not affect correctness since the CAS will check again.
while (!x.compare_exchange_weak(expected, f(expected))) {}
References:
- [1] C++ concurrency in action, 2nd edition
- [2] 理解 C++ 的 Memory Order
- [3] x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors
- [4] Linux kernel memory barriers document
- [5] A Relaxed Guide to memory_order_relaxed - Paul E. McKenney & Hans Boehm - CppCon 2020
SVE
- Scalable vector length increasing parallelism while allowing implementation choice.
- Rich addressing modes enabling non-linear data accesses.
- Per-lane predication allowing vectorization of loops containing complex control flow.
- Predicate-driven loop control and management reduces vectorization overhead relative to scalar code. A rich set of horizontal operations applicable to more types of reducible loop-carried dependencies.
- Vector partitioning and software-managed speculation enabling vectorization of loops with datadependent exits.
- Scalarized intra-vector sub-loops permitting vectorization of loops with more complex loop-carried dependencies.
Use predicates to predict the scalable registers
This state provides thirty-two new scalable vector registers(Z0–Z31). Their width is implementation dependent withinthe aforementioned range. The new registers extend thethirty-two 128-bit wide Advanced SIMD registers (V0–V31)to provide scalable containers for 64-, 32-, 16-, and 8-bit data elements.
Arm 64FX implementation
References
- https://github.com/fujitsu/A64FX/tree/master/doc
- https://www.youtube.com/watch?v=Qma7UuYifhM
- https://www.youtube.com/watch?v=3TYVqodc8w4
- https://www.youtube.com/watch?v=H3COrJQxBkQ