arm neon 优化技巧

haidaowang · 发表于 2021-4-29 17:57

EDA365欢迎您登录！

您需要登录才可以下载或查看，没有帐号？注册

x

1. 去除数据依赖
不要将当前指令的目的寄存器作为下一条指令的源寄存器！
原因：ARM架构采用的是多级流水线技术，如果下一条指令的源寄存器是当前指令的目的寄存器，就需要当前指令执行完之后，下一条指令才能取指执行，这样会产生很大的延迟，影响性能。

2. 减少分支跳转
ARM处理器中广泛使用分支预测技术。但是一旦分支预测失败，性能就会损失很大。所以，
尽量不要用分支跳转！可以采用逻辑运算指令替代分支跳转！
比如：
VCEQ, VCGE, VCGT, VCLE, VCLT……
VBIT, VBIF, VBSL……
另外，可以使用条件执行指令，比如addgt,suble等减少分支跳转！
建议一次性多处理几行数据，从而减少循环跳转的次数，提升性能。

3. 建议使用预载指令PLD
PLD允许处理器告知内存系统在不久的将来会从指定地址读取数据，若数据提取加载到cache中，将会提高cache hit，从而提升性能。
风险：不过最新架构对PLD指令的支持并不好，在不确定的情况下使用，可能会损失性能！

PLD syntax:
PLD{cond} [Rn {, #offset}]
PLD{cond} [Rn, +/-Rm {, shift}]
PLD{cond} label
Where:
Cond - is an optional condition code.
Rn - is the register on which the memory address is based.
Offset - is an immediate offset. If offset is omitted, the address is the value in Rn.
Rm - contains an offset value and must not be PC (or SP, in Thumb state).
Shift - is an optional shift.
Label - is a PC-relative expression.
7 u% C: p! ~/ Z: w" P& p, }! ~( ~0 v

4. 关注指令周期延迟
VMLA指令可以替代VMUL+VADD，但是由于VMLA的指令延迟比较大，在后面没有并行指令时，可能性能并不如VMUL+VADD。

5. NEON assembly and NEON intrinsic PeRFormance Contrast
NEON assembly:
Always shows the best performance for the specified platform for an experienced developer.
NEON intrinsic:
Depends heavily on the toolchain that is used.

Anda · 发表于 2021-4-30 09:48

减少分支跳转

帐号		自动登录	找回密码
密码			注册

arm neon 优化技巧

EDA365欢迎您登录！

浏览过的版块

推荐内容 /1