C++
终于到我校专业学习的语言了。
C++ 17 & 20
有关新特性永远是恒久不变的话题。从 C++11 的左右值到现在已经有很久时间了。在epyc 上一般的编译器性能排序是 AOCC>Intel>GCC>>LLVM. 但是MKL相对于amd的优化库还有一定优势,所以有时候Intel开最基本的x86优化也能和AOCC差不多。C++ 17对并行化做了很多优化,比如pstl、for_each(threading)、pmr、intel 的 sycl、nvidia的thrust/cub。可以很方便的修改namespace来获得无痛性能提升。C++ 20 最重要的特性是 ranges、filesystem等。不过LLVM 对新版本的支持一直都是最慢的。之前 icc 可以兼容一些古老的语法比如 VLA,但到了新版本以后取消了,这种不稳定的特性变更导致大厂不怎么把其当成标准。
这是笔者在其编译原理作业中出的一道题,需要给出 semantic rule 来拒绝 ??? 那行。但貌似只能在icc中通过标准。
template<class T>class array{
int s;
T* elements;
public:
array(int n); // allocate "n" elements and let "elements" refer to them
array(T* p, int n); // make this array refer to p[0..n-1]
operator T*(){return elements;}
int size()const{return s;}
// the usual container operations, such as = and [], much like vector
};
void h(array<double>a); //C++
void g(int m,double vla[m]); //C99
void f(int m,double vla1[m],array<double>a1) {
array<double> a2(vla1,m); // a2 refers to vla1
double*p=a1; //p refers to a1's elements
g(m,vla1);
g(a1.size(),a1); // a bit verbose
g(a1); //???
}
有关玄学
所有编译器可能出现的segfault,很多时候换个 intel 小版本可以通过。
有关编译自闭的时候的想法
建议若是 CMake 打开 make -n
, configure 打开 make VERBOSE=1
。如果需要与展开宏编译器,需要 CMake 开 -e 选项得到展开后的表达式。
需要广泛运用好 man、--help。
编译选项
LTO
为了解决不同库或者跨语言之间调用的开销,这块使用的基本是 LLVM 的 libLTO 和 tblgen。这个是自动开启的,原理是把库都弄成 LLVM bitcode 统一链接,其实并行版的 LTO 也不是很难实现,曾是前队长用 rust 写的并行计算的 project,具体可以参考源码。
PGO
通过分析程序运行时的实际行为,将结果反馈给编译器,使得编译器可以重新安排代码以减少指令缓存问题和分支预测误判,从而获得性能的提升。性能分析引导优化通过实际执行代码统计出运行频率最高的部分,编译器通过这些信息可以更加针对地优化代码。
- 第一阶段:编译参数中加上:
-prof-gen=srcpos
-prof-dir=/tmp/profdata
。其中-prof-dir
是存储性能分析文件的目录。 - 第二阶段:运行编译好的程序,然后运行
profmerge -prof_dir /tmp/profdata
生成汇总文件。 - 第三阶段:重新编译程序,使用参数:
-prof-use=nomerge -prof-func-groups -prof-dir=/tmp/profdata
。
IPO
过程间优化和性能分析引导优化可能会相互影响,性能分析引导优化通常会帮助编译器生成内联函数,这会帮助过程间优化的效率。性能分析引导优化对分支预测效率的提升最有效果,许多分支执行的可能性无法在编译时判断,而通过性能分析引导优化,编译器可以针对经常执行的分支(热代码)和不经常执行的分支(冷代码)生成高效的汇编代码。
HLO
从下文我们知道的 LLVM IR 就已经出现的部分优化,我们知道 icc 实际上在 LLVM IR 之前还拥有 high level 的 ir。根据文档,主要做了
- Loop Permutation or Interchange
- Loop Distribution
- Loop Fusion
- Loop Unrolling
- Data Prefetching
- Scalar Replacement
- Unroll and Jam
- Loop Blocking or Tiling
- Partial-Sum Optimization
- Predicate Optimization
- Loop Reversal
- Profile-Guided Loop Unrolling
- Loop Peeling
- Data Transformation: Malloc Combining and Memset Combining, - Memory Layout Change
- Loop Rerolling
- Memset and Memcpy Recognition
- Statement Sinking for Creating Perfect Loopnests
- Multiversioning: Checks include Dependency of Memory References, - and Trip Counts
- Loop Collapsing
DOP
大家如果写过 UE 这种游戏引擎的程序或者kernel中需要 cache align 的struct,以及最近几年对DB内卷式的优化,会很熟悉这种数据结构。其最重要的思想就是把数据塞在 avx 对齐的 struct 中,所有的操作都是围绕着对struct的加乘运算。详见 https://neil3d.github.io/assets/img/ecs/DOD-Cpp.pdf。
LLVM
DC++/AOCC 都开始使用 LLVM 作为中间层
ICC 大致做了什么新特性
以最新的 2021.3.0 做静态分析,用 saxpy
做标程。意思是 Single-Precision A·X Plus Y。这是一维 BLAS 中的一个函数,经常被写作 kernel 来各种调参各种调寄存器和内存模型。其 C++ 版本如下
void saxpy(int n, float a, float * x, float * y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
LLVM IR 如下,完整路径是 https://godbolt.org/z/j5rrxhedG,主要hard code 进了各种预优化好的汇编,尤其是mov高地址这种快速取指方式。感觉是把VADDSS __m128 _mm_mask_add_ss (__m128 s, __mmask8 k, __m128 a, __m128 b);
这种 Intel C/C++ Compiler Intrinsic Equivalent 当成库函数一起编译到 IR 上了。
.section .text
.LNDBG_TX:
# mark_description "Intel(R) C Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.3.0 Build 2021";
# mark_description "0609_000000";
# mark_description "-g -o /app/output.s -masm=intel -S -gxx-name=/opt/compiler-explorer/gcc-10.1.0/bin/g++ -emit-llvm";
.intel_syntax noprefix
.file "example.cpp"
.text
..TXTST0:
.L_2__routine_start_saxpy(int, float, float*, float*)_0:
# -- Begin saxpy(int, float, float*, float*)
.text
# mark_begin;
.globl saxpy(int, float, float*, float*)
# --- saxpy(int, float, float *, float *)
saxpy(int, float, float*, float*):
# parameter 1(n): edi
# parameter 2(a): xmm0
# parameter 3(x): rsi
# parameter 4(y): rdx
..B1.1: # Preds ..B1.0
# Execution count [0.00e+00]
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
..___tag_value_saxpy(int, float, float*, float*).2:
..L3:
#2.1
..LN0:
.file 1 "/app/example.cpp"
.loc 1 2 is_stmt 1
push rbp #2.1
.cfi_def_cfa_offset 16
..LN1:
mov rbp, rsp #2.1
.cfi_def_cfa 6, 16
.cfi_offset 6, -16
..LN2:
sub rsp, 48 #2.1
..LN3:
mov DWORD PTR [-40+rbp], edi #2.1
..LN4:
movss DWORD PTR [-32+rbp], xmm0 #2.1
..LN5:
mov QWORD PTR [-24+rbp], rsi #2.1
..LN6:
mov QWORD PTR [-16+rbp], rdx #2.1
..LN7:
.loc 1 3 prologue_end is_stmt 1
mov DWORD PTR [-48+rbp], 0 #3.14
..LN8:
# LOE rbx rbp rsp r12 r13 r14 r15 rip
..B1.2: # Preds ..B1.3 ..B1.1
# Execution count [0.00e+00]
..LN9:
mov eax, DWORD PTR [-48+rbp] #3.19
..LN10:
mov edx, DWORD PTR [-40+rbp] #3.23
..LN11:
cmp eax, edx #3.23
..LN12:
jge ..B1.4 # Prob 50% #3.23
..LN13:
# LOE rbx rbp rsp r12 r13 r14 r15 rip
..B1.3: # Preds ..B1.2
# Execution count [0.00e+00]
..LN14:
.loc 1 4 is_stmt 1
movss xmm0, DWORD PTR [-32+rbp] #4.14
..LN15:
mov eax, DWORD PTR [-48+rbp] #4.18
..LN16:
movsxd rax, eax #4.16
..LN17:
imul rax, rax, 4 #4.16
..LN18:
add rax, QWORD PTR [-24+rbp] #4.16
..LN19:
movss xmm1, DWORD PTR [rax] #4.16
..LN20:
mulss xmm0, xmm1 #4.16
..LN21:
mov eax, DWORD PTR [-48+rbp] #4.25
..LN22:
movsxd rax, eax #4.23
..LN23:
imul rax, rax, 4 #4.23
..LN24:
add rax, QWORD PTR [-16+rbp] #4.23
..LN25:
movss xmm1, DWORD PTR [rax] #4.23
..LN26:
addss xmm0, xmm1 #4.23
..LN27:
mov eax, DWORD PTR [-48+rbp] #4.9
..LN28:
movsxd rax, eax #4.7
..LN29:
imul rax, rax, 4 #4.7
..LN30:
add rax, QWORD PTR [-16+rbp] #4.7
..LN31:
movss DWORD PTR [rax], xmm0 #4.7
..LN32:
.loc 1 3 is_stmt 1
mov eax, 1 #3.28
..LN33:
add eax, DWORD PTR [-48+rbp] #3.28
..LN34:
mov DWORD PTR [-48+rbp], eax #3.28
..LN35:
jmp ..B1.2 # Prob 100% #3.28
..LN36:
# LOE rbx rbp rsp r12 r13 r14 r15 rip
..B1.4: # Preds ..B1.2
# Execution count [0.00e+00]
..LN37:
.loc 1 5 epilogue_begin is_stmt 1
leave #5.1
.cfi_restore 6
..LN38:
ret #5.1
..LN39:
# LOE
..LN40:
.cfi_endproc
# mark_end;
.type saxpy(int, float, float*, float*),@function
.size saxpy(int, float, float*, float*),.-saxpy(int, float, float*, float*)
..LNsaxpy(int, float, float*, float*).41:
.LNsaxpy(int, float, float*, float*):
.data
# -- End saxpy(int, float, float*, float*)
.data
.section .note.GNU-stack, ""
// -- Begin DWARF2 SEGMENT .debug_info
.section .debug_info
.debug_info_seg:
.align 1
.4byte 0x000000be
....
汇编如下,可以看到每一个分支都有概率预测。自动向量化。
saxpy(int, float, float*, float*):
mov r9, rsi #2.1
test edi, edi #3.23
jle ..B1.36 # Prob 50% #3.23
cmp edi, 6 #3.3
jle ..B1.30 # Prob 50% #3.3
movsxd r8, edi #1.6
mov rax, rdx #4.16
sub rax, r9 #4.16
lea rcx, QWORD PTR [r8*4] #3.3
cmp rax, rcx #3.3
jge ..B1.5 # Prob 50% #3.3
neg rax #4.23
cmp rax, rcx #3.3
jl ..B1.30 # Prob 50% #3.3
..B1.5: # Preds ..B1.4 ..B1.3
cmp edi, 8 #3.3
jl ..B1.38 # Prob 10% #3.3
mov r10, rdx #3.3
and r10, 15 #3.3
test r10d, r10d #3.3
je ..B1.9 # Prob 50% #3.3
test r10d, 3 #3.3
jne ..B1.38 # Prob 10% #3.3
neg r10d #3.3
add r10d, 16 #3.3
shr r10d, 2 #3.3
..B1.9: # Preds ..B1.8 ..B1.6
lea eax, DWORD PTR [8+r10] #3.3
cmp edi, eax #3.3
jl ..B1.38 # Prob 10% #3.3
mov esi, edi #3.3
xor ecx, ecx #3.3
sub esi, r10d #3.3
and esi, 7 #3.3
neg esi #3.3
add esi, edi #3.3
mov eax, r10d #3.3
test r10d, r10d #3.3
jbe ..B1.14 # Prob 9% #3.3
..B1.12: # Preds ..B1.10 ..B1.12
movss xmm1, DWORD PTR [r9+rcx*4] #4.16
mulss xmm1, xmm0 #4.16
addss xmm1, DWORD PTR [rdx+rcx*4] #4.23
movss DWORD PTR [rdx+rcx*4], xmm1 #4.7
inc rcx #3.3
cmp rcx, rax #3.3
jb ..B1.12 # Prob 82% #3.3
..B1.14: # Preds ..B1.12 ..B1.10
lea rcx, QWORD PTR [r9+rax*4] #4.16
test rcx, 15 #3.3
je ..B1.18 # Prob 60% #3.3
movaps xmm1, xmm0 #1.6
shufps xmm1, xmm1, 0 #1.6
movsxd rcx, esi #3.3
..B1.16: # Preds ..B1.16 ..B1.15
movups xmm2, XMMWORD PTR [r9+rax*4] #4.16
movups xmm3, XMMWORD PTR [16+r9+rax*4] #4.16
mulps xmm2, xmm1 #4.16
mulps xmm3, xmm1 #4.16
addps xmm2, XMMWORD PTR [rdx+rax*4] #4.23
addps xmm3, XMMWORD PTR [16+rdx+rax*4] #4.23
movups XMMWORD PTR [rdx+rax*4], xmm2 #4.7
movups XMMWORD PTR [16+rdx+rax*4], xmm3 #4.7
add rax, 8 #3.3
cmp rax, rcx #3.3
jb ..B1.16 # Prob 82% #3.3
jmp ..B1.21 # Prob 100% #3.3
..B1.18: # Preds ..B1.14
movaps xmm1, xmm0 #1.6
shufps xmm1, xmm1, 0 #1.6
movsxd rcx, esi #3.3
..B1.19: # Preds ..B1.19 ..B1.18
movups xmm2, XMMWORD PTR [r9+rax*4] #4.16
movups xmm3, XMMWORD PTR [16+r9+rax*4] #4.16
mulps xmm2, xmm1 #4.16
mulps xmm3, xmm1 #4.16
addps xmm2, XMMWORD PTR [rdx+rax*4] #4.23
addps xmm3, XMMWORD PTR [16+rdx+rax*4] #4.23
movups XMMWORD PTR [rdx+rax*4], xmm2 #4.7
movups XMMWORD PTR [16+rdx+rax*4], xmm3 #4.7
add rax, 8 #3.3
cmp rax, rcx #3.3
jb ..B1.19 # Prob 82% #3.3
..B1.21: # Preds ..B1.19 ..B1.16
lea eax, DWORD PTR [1+rsi] #3.3
cmp eax, edi #3.3
ja ..B1.36 # Prob 50% #3.3
sub r8, rcx #3.3
cmp r8, 4 #3.3
jl ..B1.39 # Prob 10% #3.3
mov eax, r8d #3.3
xor r10d, r10d #3.3
and eax, -4 #3.3
lea rdi, QWORD PTR [rdx+rcx*4] #4.23
movsxd rax, eax #3.3
lea rcx, QWORD PTR [r9+rcx*4] #4.16
..B1.24: # Preds ..B1.24 ..B1.23
movups xmm2, XMMWORD PTR [rcx+r10*4] #4.16
mulps xmm2, xmm1 #4.16
addps xmm2, XMMWORD PTR [rdi+r10*4] #4.23
movups XMMWORD PTR [rdi+r10*4], xmm2 #4.7
add r10, 4 #3.3
cmp r10, rax #3.3
jb ..B1.24 # Prob 82% #3.3
..B1.26: # Preds ..B1.24 ..B1.39
cmp rax, r8 #3.3
jae ..B1.36 # Prob 9% #3.3
movsxd rsi, esi #4.7
lea rcx, QWORD PTR [rdx+rsi*4] #4.23
lea rdx, QWORD PTR [r9+rsi*4] #4.16
..B1.28: # Preds ..B1.28 ..B1.27
movss xmm1, DWORD PTR [rdx+rax*4] #4.16
mulss xmm1, xmm0 #4.16
addss xmm1, DWORD PTR [rcx+rax*4] #4.23
movss DWORD PTR [rcx+rax*4], xmm1 #4.7
inc rax #3.3
cmp rax, r8 #3.3
jb ..B1.28 # Prob 82% #3.3
jmp ..B1.36 # Prob 100% #3.3
..B1.30: # Preds ..B1.4 ..B1.2
mov eax, edi #3.3
mov esi, 1 #3.3
xor ecx, ecx #3.3
shr eax, 1 #3.3
je ..B1.34 # Prob 9% #3.3
..B1.32: # Preds ..B1.30 ..B1.32
movss xmm1, DWORD PTR [r9+rcx*8] #4.16
mulss xmm1, xmm0 #4.16
addss xmm1, DWORD PTR [rdx+rcx*8] #4.23
movss DWORD PTR [rdx+rcx*8], xmm1 #4.7
movss xmm2, DWORD PTR [4+r9+rcx*8] #4.16
mulss xmm2, xmm0 #4.16
addss xmm2, DWORD PTR [4+rdx+rcx*8] #4.23
movss DWORD PTR [4+rdx+rcx*8], xmm2 #4.7
inc rcx #3.3
cmp rcx, rax #3.3
jb ..B1.32 # Prob 63% #3.3
lea esi, DWORD PTR [1+rcx+rcx] #4.7
..B1.34: # Preds ..B1.33 ..B1.30
lea eax, DWORD PTR [-1+rsi] #3.3
cmp eax, edi #3.3
jae ..B1.36 # Prob 9% #3.3
movsxd rsi, esi #3.3
movss xmm1, DWORD PTR [-4+r9+rsi*4] #4.16
mulss xmm0, xmm1 #4.16
addss xmm0, DWORD PTR [-4+rdx+rsi*4] #4.23
movss DWORD PTR [-4+rdx+rsi*4], xmm0 #4.7
..B1.36: # Preds ..B1.28 ..B1.21 ..B1.34 ..B1.38 ..B1.1
ret #5.1
..B1.38: # Preds ..B1.5 ..B1.7 ..B1.9
xor esi, esi #3.3
cmp edi, 1 #3.3
jb ..B1.36 # Prob 50% #3.3
..B1.39: # Preds ..B1.22 ..B1.38
xor eax, eax #3.3
jmp ..B1.26 # Prob 100% #3.3
下面是 AOCC LLVM IR emit 出来的代码,并没有在 IR 上做什么文章,和 clang emit 的基本一样。
; ModuleID = './a.c'
source_filename = "./a.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
; Function Attrs: noinline nounwind optnone uwtable
define dso_local void @saxpy(i32 %n, float %a, float* %x, float* %y) #0 {
entry:
%n.addr = alloca i32, align 4
%a.addr = alloca float, align 4
%x.addr = alloca float*, align 8
%y.addr = alloca float*, align 8
%i = alloca i32, align 4
store i32 %n, i32* %n.addr, align 4
store float %a, float* %a.addr, align 4
store float* %x, float** %x.addr, align 8
store float* %y, float** %y.addr, align 8
store i32 0, i32* %i, align 4
br label %for.cond
for.cond: ; preds = %for.inc, %entry
%0 = load i32, i32* %i, align 4
%1 = load i32, i32* %n.addr, align 4
%cmp = icmp slt i32 %0, %1
br i1 %cmp, label %for.body, label %for.end
for.body: ; preds = %for.cond
%2 = load float, float* %a.addr, align 4
%3 = load float*, float** %x.addr, align 8
%4 = load i32, i32* %i, align 4
%idxprom = sext i32 %4 to i64
%arrayidx = getelementptr inbounds float, float* %3, i64 %idxprom
%5 = load float, float* %arrayidx, align 4
%mul = fmul float %2, %5
%6 = load float*, float** %y.addr, align 8
%7 = load i32, i32* %i, align 4
%idxprom1 = sext i32 %7 to i64
%arrayidx2 = getelementptr inbounds float, float* %6, i64 %idxprom1
%8 = load float, float* %arrayidx2, align 4
%add = fadd float %mul, %8
%9 = load float*, float** %y.addr, align 8
%10 = load i32, i32* %i, align 4
%idxprom3 = sext i32 %10 to i64
%arrayidx4 = getelementptr inbounds float, float* %9, i64 %idxprom3
store float %add, float* %arrayidx4, align 4
br label %for.inc
for.inc: ; preds = %for.body
%11 = load i32, i32* %i, align 4
%inc = add nsw i32 %11, 1
store i32 %inc, i32* %i, align 4
br label %for.cond
for.end: ; preds = %for.cond
ret void
}
attributes #0 = { noinline nounwind optnone uwtable "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }
!llvm.module.flags = !{!0}
!llvm.ident = !{!1}
!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"AMD clang version 12.0.0 (CLANG: AOCC_3.0.0-Build#78 2020_12_10) (based on LLVM Mirror.Version.12.0.0)"}
和 icc 编译的向量化部分基本一样,但没有概率模型,可惜上面的概率模型的 cost model 是 intel processor 的,所以最终结果icc和aocc不分伯仲。
.text
.file "a.c"
.globl saxpy # -- Begin function saxpy
.p2align 4, 0x90
.type saxpy,@function
saxpy: # @saxpy
.cfi_startproc
# %bb.0: # %entry
testl %edi, %edi
jle .LBB0_16
# %bb.1: # %for.body.preheader
movl %edi, %r9d
cmpl $7, %edi
jbe .LBB0_2
# %bb.7: # %vector.memcheck
leaq (%rsi,%r9,4), %rax
cmpq %rdx, %rax
jbe .LBB0_9
# %bb.8: # %vector.memcheck
leaq (%rdx,%r9,4), %rax
cmpq %rsi, %rax
jbe .LBB0_9
.LBB0_2:
xorl %ecx, %ecx
.LBB0_3: # %for.body.preheader23
movq %rcx, %rax
notq %rax
testb $1, %r9b
je .LBB0_5
# %bb.4: # %for.body.prol
movss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
mulss %xmm0, %xmm1
addss (%rdx,%rcx,4), %xmm1
movss %xmm1, (%rdx,%rcx,4)
orq $1, %rcx
.LBB0_5: # %for.body.prol.loopexit
addq %r9, %rax
je .LBB0_16
.p2align 4, 0x90
.LBB0_6: # %for.body
# =>This Inner Loop Header: Depth=1
movss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
mulss %xmm0, %xmm1
addss (%rdx,%rcx,4), %xmm1
movss %xmm1, (%rdx,%rcx,4)
movss 4(%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
mulss %xmm0, %xmm1
addss 4(%rdx,%rcx,4), %xmm1
movss %xmm1, 4(%rdx,%rcx,4)
addq $2, %rcx
cmpq %rcx, %r9
jne .LBB0_6
jmp .LBB0_16
.LBB0_9: # %vector.ph
movl %r9d, %ecx
andl $-8, %ecx
movaps %xmm0, %xmm1
shufps $0, %xmm0, %xmm1 # xmm1 = xmm1[0,0],xmm0[0,0]
leaq -8(%rcx), %rax
movq %rax, %r8
shrq $3, %r8
addq $1, %r8
testq %rax, %rax
je .LBB0_10
# %bb.11: # %vector.ph.new
movq %r8, %rax
andq $-2, %rax
negq %rax
xorl %edi, %edi
.p2align 4, 0x90
.LBB0_12: # %vector.body
# =>This Inner Loop Header: Depth=1
movups (%rsi,%rdi,4), %xmm2
movups 16(%rsi,%rdi,4), %xmm3
mulps %xmm1, %xmm2
mulps %xmm1, %xmm3
movups (%rdx,%rdi,4), %xmm4
addps %xmm2, %xmm4
movups 16(%rdx,%rdi,4), %xmm2
addps %xmm3, %xmm2
movups 32(%rdx,%rdi,4), %xmm3
movups 48(%rdx,%rdi,4), %xmm5
movups %xmm4, (%rdx,%rdi,4)
movups %xmm2, 16(%rdx,%rdi,4)
movups 32(%rsi,%rdi,4), %xmm2
movups 48(%rsi,%rdi,4), %xmm4
mulps %xmm1, %xmm2
addps %xmm3, %xmm2
mulps %xmm1, %xmm4
addps %xmm5, %xmm4
movups %xmm2, 32(%rdx,%rdi,4)
movups %xmm4, 48(%rdx,%rdi,4)
addq $16, %rdi
addq $2, %rax
jne .LBB0_12
# %bb.13: # %middle.block.unr-lcssa
testb $1, %r8b
je .LBB0_15
.LBB0_14: # %vector.body.epil
movups (%rsi,%rdi,4), %xmm2
movups 16(%rsi,%rdi,4), %xmm3
mulps %xmm1, %xmm2
mulps %xmm1, %xmm3
movups (%rdx,%rdi,4), %xmm1
addps %xmm2, %xmm1
movups 16(%rdx,%rdi,4), %xmm2
addps %xmm3, %xmm2
movups %xmm1, (%rdx,%rdi,4)
movups %xmm2, 16(%rdx,%rdi,4)
.LBB0_15: # %middle.block
cmpq %r9, %rcx
jne .LBB0_3
.LBB0_16: # %for.cond.cleanup
retq
.LBB0_10:
xorl %edi, %edi
testb $1, %r8b
jne .LBB0_14
jmp .LBB0_15
.Lfunc_end0:
.size saxpy, .Lfunc_end0-saxpy
.cfi_endproc
# -- End function
.ident "AMD clang version 12.0.0 (CLANG: AOCC_3.0.0-Build#78 2020_12_10) (based on LLVM Mirror.Version.12.0.0)"
.section ".note.GNU-stack","",@progbits
.addrsig
Another test on NVHPC, actually you can hack the backend CPU part using AOCC with nvc -march=zen2 -Mvect=simd:256 -Mcache_align -fma -S a.c
.
; ModuleID = 'a.c'
target datalayout = "e-p:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"
define internal void @pgCplus_compiled.() noinline {
L.entry:
ret void
}
define void @saxpy(i32 signext %n.arg, float %a.arg, float* %x.arg, float* %y.arg) #0 !dbg !17 {
L.entry:
%n.addr = alloca i32, align 4
%a.addr = alloca float, align 4
%x.addr = alloca float*, align 8
%y.addr = alloca float*, align 8
%.ndi0002.addr = alloca i32, align 4
%.ndi0003.addr = alloca i32, align 4
%.vv0000.addr = alloca i8*, align 8
%.vv0001.addr = alloca i8*, align 8
%.vv0002.addr = alloca i8*, align 8
%.r1.0148.addr = alloca <8 x float>, align 4
%.lcr010001.addr = alloca i32, align 4
store i32 %n.arg, i32* %n.addr, align 4, !tbaa !29
store float %a.arg, float* %a.addr, align 4, !tbaa !29
store float* %x.arg, float** %x.addr, align 8, !tbaa !30
store float* %y.arg, float** %y.addr, align 8, !tbaa !30
%0 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
%1 = icmp sle i32 %0, 0, !dbg !23
br i1 %1, label %L.B0005, label %L.B0014, !dbg !23
L.B0014:
%2 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
%3 = bitcast float* %2 to i8*, !dbg !23
%4 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
%5 = bitcast float* %4 to i8*, !dbg !23
%6 = ptrtoint i8* %5 to i64, !dbg !23
%7 = sub i64 0, %6, !dbg !23
%8 = getelementptr i8, i8* %3, i64 %7, !dbg !23
%9 = icmp ule i8* %8, null, !dbg !23
br i1 %9, label %L.B0008, label %L.B0015, !dbg !23
L.B0015:
%10 = bitcast float* %2 to i8*, !dbg !23
%11 = bitcast float* %4 to i8*, !dbg !23
%12 = ptrtoint i8* %11 to i64, !dbg !23
%13 = sub i64 0, %12, !dbg !23
%14 = getelementptr i8, i8* %10, i64 %13, !dbg !23
%15 = inttoptr i64 32 to i8*, !dbg !23
%16 = icmp ult i8* %14, %15, !dbg !23
br i1 %16, label %L.B0007, label %L.B0008, !dbg !23
L.B0008:
store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%17 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
store i32 %17, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%18 = icmp slt i32 %17, 8, !dbg !23
br i1 %18, label %L.B0011, label %L.B0016, !dbg !23
L.B0016:
store i8* null, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
%19 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
%20 = bitcast float* %19 to i8*, !dbg !23
store i8* %20, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23
%21 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
%22 = bitcast float* %21 to i8*, !dbg !23
store i8* %22, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23
%23 = sub i32 %17, 7, !dbg !23
store i32 %23, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%24 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
%25 = insertelement <8 x float> undef, float %24, i32 0, !dbg !23
%26 = shufflevector <8 x float> %25, <8 x float> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, !dbg !23
store <8 x float> %26, <8 x float>* %.r1.0148.addr, align 1, !tbaa !29, !dbg !23
br label %L.B0012
L.B0012:
%27 = load <8 x float>, <8 x float>* %.r1.0148.addr, align 4, !tbaa !29, !dbg !23
%28 = load i8*, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23
%29 = load i8*, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
%30 = ptrtoint i8* %29 to i64, !dbg !23
%31 = getelementptr i8, i8* %28, i64 %30, !dbg !23
%32 = bitcast i8* %31 to <8 x float>*, !dbg !23
%33 = load <8 x float>, <8 x float>* %32, align 4, !tbaa !29, !dbg !23
%34 = load i8*, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23
%35 = getelementptr i8, i8* %34, i64 %30, !dbg !23
%36 = bitcast i8* %35 to <8 x float>*, !dbg !23
%37 = load <8 x float>, <8 x float>* %36, align 4, !tbaa !29, !dbg !23
%38 = call <8 x float> @llvm.fma.v8f32 (<8 x float> %27, <8 x float> %33, <8 x float> %37), !dbg !23
store <8 x float> %38, <8 x float>* %36, align 1, !tbaa !29, !dbg !23
%39 = getelementptr i8, i8* %29, i64 32, !dbg !23
store i8* %39, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
%40 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%41 = sub i32 %40, 8, !dbg !23
store i32 %41, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%42 = icmp sgt i32 %41, 0, !dbg !23
br i1 %42, label %L.B0012, label %L.B0017, !llvm.loop !24, !dbg !23
L.B0017:
%43 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%44 = add i32 %43, 7, !dbg !23
store i32 %44, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%45 = icmp eq i32 %44, 0, !dbg !23
br i1 %45, label %L.B0013, label %L.B0018, !dbg !23
L.B0018:
%46 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
%47 = and i32 %46, -8, !dbg !23
store i32 %47, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
br label %L.B0011
L.B0011:
%48 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%49 = sext i32 %48 to i64, !dbg !23
%50 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
%51 = getelementptr float, float* %50, i64 %49, !dbg !23
%52 = load float, float* %51, align 4, !tbaa !29, !dbg !23
%53 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
%54 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
%55 = getelementptr float, float* %54, i64 %49, !dbg !23
%56 = load float, float* %55, align 4, !tbaa !29, !dbg !23
%57 = call float @llvm.fma.f32 (float %53, float %56, float %52), !dbg !23
store float %57, float* %51, align 4, !tbaa !29, !dbg !23
%58 = add i32 %48, 1, !dbg !23
store i32 %58, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%59 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%60 = sub i32 %59, 1, !dbg !23
store i32 %60, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
%61 = icmp sgt i32 %60, 0, !dbg !23
br i1 %61, label %L.B0011, label %L.B0013, !llvm.loop !24, !dbg !23
L.B0013:
br label %L.B0009, !dbg !23
L.B0007:
store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%62 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
store i32 %62, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23
br label %L.B0010
L.B0010:
%63 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%64 = sext i32 %63 to i64, !dbg !23
%65 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
%66 = getelementptr float, float* %65, i64 %64, !dbg !23
%67 = load float, float* %66, align 4, !tbaa !29, !dbg !23
%68 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
%69 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
%70 = getelementptr float, float* %69, i64 %64, !dbg !23
%71 = load float, float* %70, align 4, !tbaa !29, !dbg !23
%72 = call float @llvm.fma.f32 (float %68, float %71, float %67), !dbg !23
store float %72, float* %66, align 4, !tbaa !29, !dbg !23
%73 = add i32 %63, 1, !dbg !23
store i32 %73, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
%74 = load i32, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23
%75 = icmp slt i32 %73, %74, !dbg !23
br i1 %75, label %L.B0010, label %L.B0009, !dbg !23
L.B0009:
br label %L.B0005
L.B0005:
ret void, !dbg !26
}
declare float @llvm.fma.f32(float, float, float)
declare <8 x float> @llvm.fma.v8f32(<8 x float>, <8 x float>, <8 x float>)
declare i32 @__gxx_personality_v0(...)
; Named metadata
!llvm.module.flags = !{ !1, !2 }
!llvm.dbg.cu = !{ !10 }
; Metadata
!1 = !{ i32 2, !"Dwarf Version", i32 4 }
!2 = !{ i32 2, !"Debug Info Version", i32 3 }
!3 = !DIFile(filename: "a.c", directory: "/home/victoryang")
; !4 = !DIFile(tag: DW_TAG_file_type, pair: !3)
!4 = !{ i32 41, !3 }
!5 = !{ }
!6 = !{ }
!7 = !{ !17 }
!8 = !{ }
!9 = !{ }
!10 = distinct !DICompileUnit(file: !3, language: DW_LANG_C_plus_plus, producer: " NVC++ 21.5-0", enums: !5, retainedTypes: !6, globals: !8, emissionKind: FullDebug, imports: !9)
!11 = !DIBasicType(tag: DW_TAG_base_type, name: "int", size: 32, align: 32, encoding: DW_ATE_signed)
!12 = !DIBasicType(tag: DW_TAG_base_type, name: "float", size: 32, align: 32, encoding: DW_ATE_float)
!13 = !DIDerivedType(tag: DW_TAG_pointer_type, size: 64, align: 64, baseType: !12)
!14 = !{ null, !11, !12, !13, !13 }
!15 = !DISubroutineType(types: !14)
!16 = !{ }
!17 = distinct !DISubprogram(file: !3, scope: !10, name: "saxpy", line: 2, type: !15, spFlags: 8, unit: !10, scopeLine: 2)
!18 = !DILocation(line: 2, column: 1, scope: !17)
!19 = !DILexicalBlock(file: !3, scope: !17, line: 2, column: 1)
!20 = !DILocation(line: 2, column: 1, scope: !19)
!21 = !DILexicalBlock(file: !3, scope: !19, line: 2, column: 1)
!22 = !DILocation(line: 2, column: 1, scope: !21)
!23 = !DILocation(line: 3, column: 1, scope: !21)
!24 = !{ !24, !25 }
!25 = !{ !"llvm.loop.vectorize.enable", i1 0 }
!26 = !DILocation(line: 5, column: 1, scope: !19)
!27 = !{ !"PGI C[++] TBAA" }
!28 = !{ !"omnipotent char", !27, i64 0 }
!29 = !{ !28, !28, i64 0 }
!30 = !{ !"<T>*", !28, i64 0 }
!31 = !{ !"int", !28, i64 0 }
!32 = !{ !31, !31, i64 0 }
!33 = !{ !"float", !28, i64 0 }
!34 = !{ !33, !33, i64 0 }
and
.text
.file "a.ll"
.globl saxpy # -- Begin function saxpy
.p2align 4, 0x90
.type saxpy,@function
saxpy: # @saxpy
.Lfunc_begin0:
.file 1 "/home/victoryang/a.c"
.loc 1 2 0 # a.c:2:0
.cfi_sections .debug_frame
.cfi_startproc
# %bb.0: # %L.entry
.loc 1 3 1 prologue_end # a.c:3:1
testl %edi, %edi
jle .LBB0_19
# %bb.1: # %L.B0014
movq %rdx, %rax
subq %rsi, %rax
je .LBB0_11
# %bb.2: # %L.B0014
cmpq $31, %rax
ja .LBB0_11
# %bb.3: # %L.B0010.preheader
movl %edi, %eax
cmpl $31, %edi
jbe .LBB0_4
# %bb.5: # %vector.memcheck
leaq (%rsi,%rax,4), %rcx
cmpq %rdx, %rcx
jbe .LBB0_7
# %bb.6: # %vector.memcheck
.loc 1 0 1 is_stmt 0 # a.c:0:1
leaq (%rdx,%rax,4), %rcx
.loc 1 3 1 # a.c:3:1
cmpq %rsi, %rcx
jbe .LBB0_7
.LBB0_4:
.loc 1 0 1 # a.c:0:1
xorl %ecx, %ecx
.p2align 4, 0x90
.LBB0_10: # %L.B0010
# =>This Inner Loop Header: Depth=1
.loc 1 3 1 # a.c:3:1
vmovss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
vfmadd213ss (%rdx,%rcx,4), %xmm0, %xmm1 # xmm1 = (xmm0 * xmm1) + mem
vmovss %xmm1, (%rdx,%rcx,4)
incq %rcx
cmpq %rcx, %rax
jne .LBB0_10
jmp .LBB0_19
.LBB0_11: # %L.B0008
.loc 1 0 1 # a.c:0:1
xorl %ecx, %ecx
.loc 1 3 1 # a.c:3:1
cmpl $8, %edi
jge .LBB0_13
# %bb.12:
.loc 1 0 1 # a.c:0:1
movl %edi, %eax
jmp .LBB0_17
.LBB0_13: # %L.B0016
.loc 1 3 1 # a.c:3:1
vbroadcastss %xmm0, %ymm1
xorl %ecx, %ecx
movl %edi, %eax
.p2align 4, 0x90
.LBB0_14: # %L.B0012
# =>This Inner Loop Header: Depth=1
vmovups (%rsi,%rcx), %ymm2
movl %eax, %r8d
vfmadd213ps (%rdx,%rcx), %ymm1, %ymm2 # ymm2 = (ymm1 * ymm2) + mem
leal -8(%r8), %eax
addl $-7, %r8d
vmovups %ymm2, (%rdx,%rcx)
addq $32, %rcx
cmpl $8, %r8d
jg .LBB0_14
# %bb.15: # %L.B0017
testl %eax, %eax
je .LBB0_19
# %bb.16: # %L.B0018
andl $-8, %edi
movl %edi, %ecx
.LBB0_17: # %L.B0011.preheader
incl %eax
.p2align 4, 0x90
.LBB0_18: # %L.B0011
# =>This Inner Loop Header: Depth=1
movslq %ecx, %rcx
decl %eax
vmovss (%rsi,%rcx,4), %xmm1 # xmm1 = mem[0],zero,zero,zero
vfmadd213ss (%rdx,%rcx,4), %xmm0, %xmm1 # xmm1 = (xmm0 * xmm1) + mem
vmovss %xmm1, (%rdx,%rcx,4)
incl %ecx
cmpl $1, %eax
jg .LBB0_18
.Ltmp0:
.LBB0_19: # %L.B0005
.loc 1 5 1 is_stmt 1 # a.c:5:1
vzeroupper
retq
.LBB0_7: # %vector.ph
.Ltmp1:
.loc 1 3 1 # a.c:3:1
vbroadcastss %xmm0, %ymm1
movl %eax, %ecx
xorl %edi, %edi
andl $-32, %ecx
.p2align 4, 0x90
.LBB0_8: # %vector.body
# =>This Inner Loop Header: Depth=1
vmovups (%rsi,%rdi,4), %ymm2
vmovups 32(%rsi,%rdi,4), %ymm3
vmovups 64(%rsi,%rdi,4), %ymm4
vmovups 96(%rsi,%rdi,4), %ymm5
vfmadd213ps (%rdx,%rdi,4), %ymm1, %ymm2 # ymm2 = (ymm1 * ymm2) + mem
vfmadd213ps 32(%rdx,%rdi,4), %ymm1, %ymm3 # ymm3 = (ymm1 * ymm3) + mem
vfmadd213ps 64(%rdx,%rdi,4), %ymm1, %ymm4 # ymm4 = (ymm1 * ymm4) + mem
vfmadd213ps 96(%rdx,%rdi,4), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
vmovups %ymm2, (%rdx,%rdi,4)
vmovups %ymm3, 32(%rdx,%rdi,4)
vmovups %ymm4, 64(%rdx,%rdi,4)
vmovups %ymm5, 96(%rdx,%rdi,4)
addq $32, %rdi
cmpq %rdi, %rcx
jne .LBB0_8
# %bb.9: # %middle.block
cmpq %rax, %rcx
jne .LBB0_10
jmp .LBB0_19
.Ltmp2:
.Lfunc_end0:
.size saxpy, .Lfunc_end0-saxpy
.cfi_endproc
# -- End function
.section .debug_abbrev,"",@progbits
.byte 1 # Abbreviation Code
.byte 17 # DW_TAG_compile_unit
.byte 1 # DW_CHILDREN_yes
.byte 37 # DW_AT_producer
.byte 14 # DW_FORM_strp
.byte 19 # DW_AT_language
.byte 5 # DW_FORM_data2
.byte 3 # DW_AT_name
.byte 14 # DW_FORM_strp
.byte 16 # DW_AT_stmt_list
.byte 23 # DW_FORM_sec_offset
.byte 27 # DW_AT_comp_dir
.byte 14 # DW_FORM_strp
.ascii "\264B" # DW_AT_GNU_pubnames
.byte 25 # DW_FORM_flag_present
.byte 17 # DW_AT_low_pc
.byte 1 # DW_FORM_addr
.byte 18 # DW_AT_high_pc
.byte 6 # DW_FORM_data4
.byte 0 # EOM(1)
.byte 0 # EOM(2)
.byte 2 # Abbreviation Code
.byte 46 # DW_TAG_subprogram
.byte 0 # DW_CHILDREN_no
.byte 17 # DW_AT_low_pc
.byte 1 # DW_FORM_addr
.byte 18 # DW_AT_high_pc
.byte 6 # DW_FORM_data4
.byte 64 # DW_AT_frame_base
.byte 24 # DW_FORM_exprloc
.byte 3 # DW_AT_name
.byte 14 # DW_FORM_strp
.byte 58 # DW_AT_decl_file
.byte 11 # DW_FORM_data1
.byte 59 # DW_AT_decl_line
.byte 11 # DW_FORM_data1
.byte 63 # DW_AT_external
.byte 25 # DW_FORM_flag_present
.byte 0 # EOM(1)
.byte 0 # EOM(2)
.byte 0 # EOM(3)
.section .debug_info,"",@progbits
.Lcu_begin0:
.long .Ldebug_info_end0-.Ldebug_info_start0 # Length of Unit
.Ldebug_info_start0:
.short 4 # DWARF version number
.long .debug_abbrev # Offset Into Abbrev. Section
.byte 8 # Address Size (in bytes)
.byte 1 # Abbrev [1] 0xb:0x35 DW_TAG_compile_unit
.long .Linfo_string0 # DW_AT_producer
.short 4 # DW_AT_language
.long .Linfo_string1 # DW_AT_name
.long .Lline_table_start0 # DW_AT_stmt_list
.long .Linfo_string2 # DW_AT_comp_dir
# DW_AT_GNU_pubnames
.quad .Lfunc_begin0 # DW_AT_low_pc
.long .Lfunc_end0-.Lfunc_begin0 # DW_AT_high_pc
.byte 2 # Abbrev [2] 0x2a:0x15 DW_TAG_subprogram
.quad .Lfunc_begin0 # DW_AT_low_pc
.long .Lfunc_end0-.Lfunc_begin0 # DW_AT_high_pc
.byte 1 # DW_AT_frame_base
.byte 87
.long .Linfo_string3 # DW_AT_name
.byte 1 # DW_AT_decl_file
.byte 2 # DW_AT_decl_line
# DW_AT_external
.byte 0 # End Of Children Mark
.Ldebug_info_end0:
.section .debug_str,"MS",@progbits,1
.Linfo_string0:
.asciz " NVC++ 21.5-0" # string offset=0
.Linfo_string1:
.asciz "a.c" # string offset=14
.Linfo_string2:
.asciz "/home/victoryang" # string offset=18
.Linfo_string3:
.asciz "saxpy" # string offset=35
.section .debug_pubnames,"",@progbits
.long .LpubNames_end0-.LpubNames_begin0 # Length of Public Names Info
.LpubNames_begin0:
.short 2 # DWARF Version
.long .Lcu_begin0 # Offset of Compilation Unit Info
.long 64 # Compilation Unit Length
.long 42 # DIE offset
.asciz "saxpy" # External Name
.long 0 # End Mark
.LpubNames_end0:
.section .debug_pubtypes,"",@progbits
.long .LpubTypes_end0-.LpubTypes_begin0 # Length of Public Types Info
.LpubTypes_begin0:
.short 2 # DWARF Version
.long .Lcu_begin0 # Offset of Compilation Unit Info
.long 64 # Compilation Unit Length
.long 0 # End Mark
.LpubTypes_end0:
.section ".note.GNU-stack","",@progbits
.section .debug_line,"",@progbits
.Lline_table_start0:
GCC arm SVE
对于超算来说应该介绍 Arm FX 64 的。但笔者觉得还是先科普一下 SVE 比较好,说不定下一次 ISC 就有了。
saxpy with neon
// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
saxpy_:
ldrswx3, [x3] // x3=*n
movx4, #0 // x4=i=0
ldrd0, [x2] // d0=*a
b .latch
.loop:
ldrd1, [x0, x4, lsl #3]// d1=x[i]
ldrd2, [x1, x4, lsl #3]// d2=y[i]
fmaddd2, d1, d0, d2. // d2+=x[i]*a
strd2, [x1, x4, lsl #3]// y[i]=d2
addx4, x4, #1 // i+=1
.latch:
cmpx4, x3 // i < n
b.lt .loop// more to do?
ret
saxpy with sve
// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
saxpy_:
ldrswx3, [x3] // x3=*n
movx4, #0 // x4=i=0
whileltp0.d, x4, x3 // p0=while(i++<n)
ld1rdz0.d, p0/z, [x2]// p0:z0=bcast(*a)
.loop:
ld1dz1.d, p0/z, [x0, x4, lsl #3]// p0:z1=x[i]
ld1dz2.d, p0/z, [x1, x4, lsl #3]// p0:z2=y[i]
fmlaz2.d, p0/m, z1.d, z0.d // p0?z2+=x[i]*a
st1dz2.d, p0, [x1, x4, lsl #3] // p0?y[i]=z2
incdx4 // i+=(VL/64)
.latch:
whileltp0.d, x4, x3 // p0=while(i++<n)
b.first .loop // more to do?
ret
There is no overhead in instruction count for the SVE version when compared to the equivalent scalar code, which allows a compiler to opportunistically vectorize loops withan unknown trip count.
-
16个可伸缩预测寄存器(P0-P15):普通的内存和算数操作的控制仅限于P0-P7。但是生成predicate的指令(向量比较)和依赖predicate的指令(逻辑操作)会使用全部寄存器P0-P15。通过分析编译和手动优化,这样的分配方案被验证有效并且减轻了predicate寄存器在其它架构上被观察到的压力
-
mixed element尺寸控制:每个predicate寄存器允许将最低粒度降低到字节水平,所以每个bit位对应8blt的数据宽度。
-
predicate条件:在SVE中产生predication的指令是复用NZCV条件码flags,这个NZCV有不同的解释
-
隐式顺序:predicate有一个隐式地从最低到最高位元素顺序解释,对应于一个等效的序列顺序。我们引用与此顺序有关的第一个和最后一个predicate elements以及它们的关联条件。
whileltp0
is to predict before the last max alignment which may cause throughput drain. OoO may shot the last element with low occpancy which lead to waste of this shot, alternatively, it could shot lower other (2^n) aligned instruction.
For hazard execution and speculation, you could easily doing gather load with z3 reg fault trap and reload.
OpenMP
The compiler support the openmp by default. The OpenMP standard for specifying threading in programming languages like C and Fortran is implemented in the compiler itself and as such is an integral part of the compiler in question. The OMP and POSIX thread library underneath can vary, but this is normally hidden from the user. OpenMP makes use of POSIX threads so both an OpenMP library and a POSIX thread library is needed. The POSIX thread library is normally supplied with the distribution (typically /usr/lib64/libpthread.so).
\[ \begin{array}{|l|c|c|} \hline \text { Compiler } & \text { Flag to select OpenMP } & \text { OpenMP version supported } \\ \hline \text { Intel compilers } & \text {-qopenmp } & \text { From } 17.0 \text { on : } 4.5 \\ \hline \text { GNU compilers } & \text {-fopenmp } & \text { From GCC 6.1 on : 4.5 } \\ \hline \text { PGI compilers } & -\mathrm{mp} & 4.5 \\ \hline \end{array} \]
You definitely need to watch Fanrui's PPT and understand the implementation of OpenMP in Clang.
Ref
- https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/lecture-slides/MIT6_172F18_lec9.pdf
- 程序员的自我修养
- 不同编译器的编译行为比较
- The ARM Scalable Vector Extension
- https://www.stonybrook.edu/commcms/ookami/support/_docs/1%20-%20Intro%20to%20A64FX.pdf
- https://llvm.org/devmtg/2017-10/slides/Ferguson-Enabling%20Parallel%20Computing%20in%20Chapel.pdf