Ascend Triton 新手开发指南：手把手实现第一个 NPU 算子

May 1, 2026

作者：Harry Fan | 技术栈：Triton · Ascend NPU · CANN · torch_npu

1. 技术栈全景回顾

在开始之前，先建立整体概念。从 Python 代码到昇腾 NPU 可执行文件，完整流水线如下：

Python/Triton 代码
       |
       v
Triton IR (TTIR)       <-- 你写的 @triton.jit 内核
       |  triton-ascend 编译
       v
TTGIR (Triton GPU IR)
       |  GPUToHFusion Pass
       v
HFusion Dialect         <-- 昇腾高级 IR（融合操作）
       |  HFusionToHIVM Pass
       v
HIVM Dialect            <-- 昇腾中级 IR（向量化核心指令）
       |  bishengir-compile
       v
Ascend 指令 (二进制)
       |
       v
NPU Runtime (CANN)     <-- 昇腾驱动 + 工具链
       |
       v
昇腾物理硬件

涉及的核心项目：

项目	作用	定位
Triton-Ascend	Triton 的昇腾后端	Python PyPI 包
triton-ascend-ops	昇腾定制算子库	Triton 扩展
AscendNPU-IR (bishengir)	TTGIR → 昇腾 IR 编译器	GitCode 源码
CANN	昇腾异构计算架构	驱动 + 工具链

2. 环境准备

2.1 硬件与操作系统要求

昇腾 910 系列或 310 系列 NPU
Ubuntu 18.04+ / CentOS 7.6+
Python 3.9 ~ 3.11（不支持 3.12）

2.2 安装 CANN（昇腾计算架构）

CANN 是所有昇腾开发的基础环境，必须首先安装。

# 1. 下载 CANN 工具包（从昇腾社区）
# 访问 https://www.hiascend.com/cann/download
# 下载 toolkit 包 + ops 包（对应你的硬件型号）

# 2. 安装（以 9.0.0 版本为例）
chmod +x Ascend-cann_9.0.0_linux-x86_64.run
chmod +x Ascend-cann-A3-ops_9.0.0_linux-x86_64.run

# 安装工具链
./Ascend-cann_9.0.0_linux-x86_64.run --full --install-path=/opt/ascend

# 安装 ops 包
./Ascend-cann-A3-ops_9.0.0_linux-x86_64.run --install --install-path=/opt/ascend

# 3. 设置环境变量
source /opt/ascend/cann/set_env.sh

# 4. 验证
echo $ASCEND_HOME_PATH  # 应输出 /opt/ascend

注意：CANN 版本与 torch_npu 版本必须匹配，跨版本组合可能导致内核编译失败。

2.3 安装 torch_npu 和 triton-ascend

# 确认 Python 版本
python --version  # 必须是 3.9-3.11

# 安装 PyTorch（昇腾定制版）
pip install torch==2.7.1

# 安装 torch_npu（核心依赖）
pip install torch_npu==2.7.1

# 安装 triton-ascend（昇腾后端）
pip install triton-ascend

# 验证安装
python -c "import torch; import torch_npu; print('torch_npu:', torch_npu.__version__)"
python -c "import triton; print('triton:', triton.__version__)"
python -c "import triton.language.extra.cann.extension as al; print('Ascend extension OK')"

2.4 环境变量汇总

将以下内容加入 ~/.bashrc：

# CANN 环境
export ASCEND_HOME_PATH=/opt/ascend
source $ASCEND_HOME_PATH/cann/set_env.sh

# Python 路径
export PATH=$PATH:$HOME/.local/bin

3. 第一个 Triton 算子：矢量加法

3.1 为什么从矢量加法开始？

矢量加法（y = a + b）是所有并行计算的基石：

计算模式简单，易于理解
完美契合 NPU SIMT 执行模型
调试工具齐全，容易验证正确性
后续复杂算子（MatMul、Softmax）都是它的扩展

3.2 完整代码

创建文件 vector_add.py：

import torch
import triton
import triton.language as tl

# ============================================================
# 第一部分：Triton Kernel 定义
# ============================================================

@triton.jit                           # JIT 编译装饰器
def vector_add_kernel(
    a_ptr,                            # 输入指针（可以是任意类型）
    b_ptr,
    output_ptr,
    n_elements: tl.constexpr,         # constexpr = 编译时常量
    BLOCK_SIZE: tl.constexpr,         # 每个 thread block 处理的数据量
):
    # program_id：当前 thread block 的全局索引
    block_start = tl.program_id(0) * BLOCK_SIZE
    
    # offsets：当前 block 内所有线程的偏移量
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    
    # mask：防止越界访问
    mask = offsets < n_elements
    
    # 从 global memory 加载数据
    a = tl.load(a_ptr + offsets, mask=mask, other=0.0)
    b = tl.load(b_ptr + offsets, mask=mask, other=0.0)
    
    # 计算
    output = a + b
    
    # 将结果写回 global memory
    tl.store(output_ptr + offsets, output, mask=mask)


# ============================================================
# 第二部分：PyTorch 封装（Python API）
# ============================================================

def vector_add(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    assert a.shape == b.shape, "a 和 b 必须有相同的 shape"
    
    # 确保数据在 NPU 上连续排列
    a = a.contiguous()
    b = b.contiguous()
    
    # 分配输出张量
    n_elements = a.numel()
    output = torch.empty_like(a)
    
    # 定义 Block 大小（编译时常量）
    BLOCK_SIZE = 128
    
    # 定义 grid（thread block 的数量）
    grid = lambda meta: ((n_elements + meta['BLOCK_SIZE'] - 1) // meta['BLOCK_SIZE'],)
    
    # 启动 kernel
    vector_add_kernel[grid](
        a, b, output, n_elements, BLOCK_SIZE=BLOCK_SIZE
    )
    
    return output


# ============================================================
# 第三部分：测试代码
# ============================================================

if __name__ == "__main__":
    N = 1024
    a = torch.randn(N, device="npu", dtype=torch.float32)
    b = torch.randn(N, device="npu", dtype=torch.float32)
    
    output_triton = vector_add(a, b)
    output_torch = a + b
    
    torch.testing.assert_close(output_triton, output_torch, rtol=1e-4, atol=1e-4)
    print("PASS: 矢量加法测试通过！")

3.3 运行

python vector_add.py

正常输出：

PASS: 矢量加法测试通过！
   输入 shape: (1024,)
   Block 大小: 128
   Grid 大小: 8

4. 编译流程详解

4.1 四层 IR 变换

当你调用 vector_add_kernel[grid](...) 时，Triton-Ascend 实际上执行了以下编译流水线：

步骤	IR 层	说明
1	TTIR	Triton IR -- 你的 Python @triton.jit 代码
2	TTGIR	Triton GPU IR -- device-specific 类型推导
3	HFusion	高级融合 IR -- GPUToHFusion Pass
4	HIVM	向量化核心 IR -- HFusionToHIVM Pass
5	Binary	Ascend 指令 -- bishengir-compile 编译

4.2 捕获中间 IR（调试用）

import triton.language.extra.cann.extension as al

@triton.jit
def vector_add_kernel_debug(a_ptr, b_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    al.print(a, "a before add")
    al.debug_barrier(al.SYNC_IN_VF.VV_ALL)
    # ...

5. PyTorch 集成：torch.autograd 兼容

5.1 带梯度的矢量加法

from torch.autograd import Function

@triton.jit
def vector_add_fwd_kernel(a_ptr, b_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    block_start = tl.program_id(0) * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    a = tl.load(a_ptr + offsets, mask=mask, other=0.0)
    b = tl.load(b_ptr + offsets, mask=mask, other=0.0)
    tl.store(output_ptr + offsets, a + b, mask=mask)

@triton.jit
def vector_add_bwd_kernel(grad_out_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    block_start = tl.program_id(0) * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    grad = tl.load(grad_out_ptr + offsets, mask=mask, other=0.0)
    tl.store(grad_out_ptr + offsets, grad, mask=mask)

class VectorAddFn(Function):
    @staticmethod
    def forward(ctx, a, b):
        output = torch.empty_like(a)
        n_elements = a.numel()
        grid = lambda meta: ((n_elements + meta['BLOCK_SIZE'] - 1) // meta['BLOCK_SIZE'],)
        vector_add_fwd_kernel[grid](a, b, output, n_elements, BLOCK_SIZE=128)
        ctx.save_for_backward(a, b)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        return grad_output, grad_output

class VectorAdd(torch.nn.Module):
    def forward(self, a, b):
        return VectorAddFn.apply(a, b)

5.2 验证梯度

from torch.autograd import gradcheck

model = VectorAdd().to("npu")
a = torch.randn(256, device="npu", dtype=torch.float64, requires_grad=True)
b = torch.randn(256, device="npu", dtype=torch.float64, requires_grad=True)

assert gradcheck(model, (a, b), rtol=1e-4, atol=1e-4)
print("PASS: 梯度验证通过！")

6. 进阶案例：融合矢量加法

在真实模型中，y = (a + b) * scale 场景：

@triton.jit
def fused_add_scale_kernel(
    a_ptr, b_ptr, scale_val,
    output_ptr, n_elements, BLOCK_SIZE: tl.constexpr
):
    block_start = tl.program_id(0) * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    a = tl.load(a_ptr + offsets, mask=mask, other=0.0)
    b = tl.load(b_ptr + offsets, mask=mask, other=0.0)
    output = (a + b) * scale_val
    tl.store(output_ptr + offsets, output, mask=mask)

def fused_add_scale(a, b, scale):
    output = torch.empty_like(a)
    n_elements = a.numel()
    grid = lambda meta: ((n_elements + meta['BLOCK_SIZE'] - 1) // meta['BLOCK_SIZE'],)
    fused_add_scale_kernel[grid](a, b, scale, output, n_elements, BLOCK_SIZE=128)
    return output

为什么融合更好？

减少 global memory 访问（只 load 两次，store 一次）
减少 kernel launch 开销
编译器可以在 block 级别做更好的指令调度

7. 调试与性能分析

7.1 常见报错与解决方案

报错信息	原因	解决方案
No module named 'triton'	未安装 triton-ascend	pip install triton-ascend
device-side assert triggered	内存越界访问	检查 mask 和 BLOCK_SIZE
浮点异常	除零或 NaN	检查输入数据
NPU not found	CANN 未正确安装	source $ASCEND_HOME_PATH/cann/set_env.sh
block size too large	BLOCK_SIZE 超过硬件限制	减小到 128 或 256

7.2 msprof 性能分析

# 设置 msprof 环境
export TOOLCHAIN_HOME=$ASCEND_HOME_PATH/latest
source $ASCEND_HOME_PATH/latest/set_env.sh

# 运行带 profiling 的测试
python -m torch_npu.npu.profile --save-profile=./profile_output vector_add.py

# 查看 profile 结果
msprof --export ./profile_output --output-dir ./profile_report

7.3 Triton 调试技巧

# 1. 打印中间值
@triton.jit
def debug_kernel(...):
    al.print(a, "a values:")
    al.debug_barrier(al.SYNC_IN_VF.VV_ALL)
    
# 2. 断言条件
@triton.jit
def safe_kernel(...):
    assert n_elements > 0, "n_elements must be positive"
    assert BLOCK_SIZE <= 1024, "BLOCK_SIZE too large"

# 3. 快速验证
BLOCK_SIZE = 4
N = 16

8. 从 Triton 算子到 PyTorch 自动化转换

当你有现成的 PyTorch 算子想迁移到 Triton NPU 时，可以使用 triton-agent 工具链自动化转换：

# 克隆 triton-agent
cd ~/code
git clone https://github.com/your-org/triton-agent.git

# 转换 PyTorch 算子
python -m triton_agent convert \
    --input ~/path/to/your/pytorch_operator.py \
    --output ~/path/to/output.py \
    --backend ascend

# 生成差异测试
python -m triton_agent gen-test \
    --operator ~/path/to/output.py \
    --reference ~/path/to/your/pytorch_operator.py

# 运行测试验证
python -m triton_agent run-eval \
    --test tests/differential_test.py

9. 完整项目结构建议

对于实际开发，推荐以下目录结构：

my_triton_project/
|-- kernels/                  # Triton kernel 定义
|   |-- __init__.py
|   |-- vector_ops.py        # 矢量操作（add, sub, scale...）
|   |-- matmul.py            # 矩阵乘法
|   |-- reduction.py         # 归约操作
|-- ops/                      # PyTorch 封装
|   |-- __init__.py
|   |-- triton_ops.py
|-- tests/                    # 测试
|   |-- test_vector_ops.py
|   |-- test_matmul.py
|-- profiles/                 # 性能分析结果
|   |-- .gitkeep
|-- scripts/
|   |-- profile_ops.py       # 性能测试脚本
|   |-- benchmark.py
|-- requirements.txt

10. 下一步学习路径

学完本文后，建议按以下顺序继续深入：

阶段	主题	学习目标
进阶 1	Triton MatMul 算子	理解 Tile、2D Block、shared memory 优化
进阶 2	Softmax / Layernorm	掌握 warp-reduce、online softmax 算法
进阶 3	Attention 算子	融合 QKV 投影 + FlashAttention 模式
进阶 4	Ascend CustomOp	使用 al.register_custom_op 注册设备端函数
进阶 5	IR 调优	分析 TTGIR -> HFusion -> HIVM 中间产物
进阶 6	msprof 调优	使用硬件计数器定位瓶颈