Ascend Triton 技术栈全景图：华为昇腾 NPU 的 Triton 编译器生态

May 1, 2026

作者：Harry Fan | 技术栈：Triton · Ascend NPU · MLIR · CANN

一、为什么需要 Ascend Triton 技术栈？

Triton 是 OpenAI 出品的 DSL（领域特定语言），用于编写高效 GPU/NPU 内核。其核心理念是：让研究者和工程师用 Pythonic 的方式写并行计算内核，而编译器负责生成硬件友好的底层代码。

华为昇腾（Ascend）NPU 在国内拥有大量部署，但 Triton 原生只支持 NVIDIA GPU 和 AMD GPU。华为要承接 Triton 生态，需要自行维护一套从 Triton 前端到 Ascend 硬件的完整编译器栈。

这就是今天要梳理的 Ascend Triton 技术栈——从用户写的 @triton.jit 到 Ascend NPU 可执行文件，中间涉及哪些项目、它们之间什么关系、依赖是如何组织的。

二、全景架构图

用户层 (User Space)
================================================================================

  @triton.jit kernel                          PyTorch (@torch.npu)
        |                                           |
================== 框架接入层 (Framework Integration) ==================

  Triton-Ascend                             torch_npu
  gitcode.com/Ascend/triton-ascend          pip install torch_npu
         |                                      |
         |         +---------------------------+
         |         |
         |    TileLang-Ascend (TVM-based DSL)
         |    gitcode.com/Ascend/tilelang-ascend
         |         |
========+=========+=============================================================
        |         |
        v         v
================== IR 降级层 (MLIR Compilation) =========================

  TTAdapter (.ttadapter.mlir)        Torch IR    Linalg/HFusion IR
         |                              |              |
         +---------------+--------------+              |
                     v                              |
         +------------------------------------+---+
         |     AscendNPU-IR (bishengir)         |
         |  gitcode.com/Ascend/AscendNPU-IR     |
         |  github.com/Ascend/AscendNPU-IR      |
         |                                        |
         |  TTAdapter -- Linalg -- HFusion -- HIVM -- .o
         |                                        |
         |  HFusion Dialect | HIVM Dialect | HACC
         |                                        |
         |  bishengir-opt / bishengir-compile    |
         |  --> Ascend NPU Binary (.o)        |
         +----------------------------------------+
================== 硬件运行层 (Runtime) =================================

  CANN (Compute Architecture for Neural Networks)
  AscendCL (算子调用抽象) | ACLNN (神经网络算子库) | HCCL (多卡通信)

  Ascend 910B (A3 ops) | Atlas A2 (A3 ops) | Atlas 300I Pro (A2 ops)

三、核心项目逐一解析

3.1 Triton-Ascend

地址：gitcode.com/Ascend/triton-ascend

定位：Triton 的 Ascend 分支。华为从 Triton 上游 fork 而来，在保持 Triton Python API 兼容的前提下，添加 Ascend 后端支持。

核心职责：

提供 Ascend 版本的 Triton 编译器（triton.ascend）
将用户写的 @triton.jit 编译为 TTIR / TTGIR
将 TTGIR 转换为 TTAdapter 格式（.ttadapter.mlir）
添加 Ascend 特有的 Python 扩展 API（triton.language.extra.cann.extension）

版本注意：

triton-ascend 3.2.0 之前:  tl.compile_hint(...)          # 旧 API
triton-ascend 3.4.0 之后:  tl.extra.cann.extension.compile_hint(...)  # 新 API

Ascend 扩展 API 示例：

import triton.language.extra.cann.extension as al

@triton.jit
def kernel():
    # 同步控制
    al.debug_barrier(al.SYNC_IN_VF.VV_ALL)
    al.sync_block_set("cube", "vector", event_id, ...)
    al.sync_block_all("all_cube", event_id)

    # 硬件查询
    sub_vec_id = al.sub_vec_id()
    sub_vec_num = al.sub_vec_num()

    # 作用域控制
    with al.scope(core_mode="cube"):    # Cube 核计算
        ...
    with al.scope(core_mode="vector"):  # Vector 核计算
        ...

    # 切片操作
    x_sub = al.extract_slice(x, offsets, sizes, strides)
    al.insert_slice(dst, src, offsets, sizes, strides)

3.2 AscendNPU-IR (bishengir)

地址：gitcode.com/Ascend/AscendNPU-IR / github.com/Ascend/AscendNPU-IR

定位：华为基于 MLIR 的昇腾编译器基础设施。bishengir 是项目内的编译器名称，源自"笔声"（取音译）。

核心职责：

接收 Triton-Ascend 输出的 TTAdapter (.ttadapter.mlir)
执行多级 MLIR dialect 降级：Linalg -- HFusion -- HIVM
自动算子融合、Tiling、Scheduling
生成 Ascend NPU 可执行二进制 (.o)

核心 Dialect：

Dialect	层级	职责
HFusion	高层	硬件感知融合调度层，自动将多个算子融合成 Ascend 友好 kernel
HIVM	低层	NPU 指令层，显式控制 GM/UB/L1/L0 内存和 Vector/Cube/MTE 流水线
HACC	硬件访问	硬件访问类别注解
MemRefExt	内存操作	扩展内存操作

3.3 triton-ascend-ops

地址：github.com/Ascend/triton-ascend-ops

定位：Ascend 定制 Triton 算子库，包含 Triton Ascend 特有的算子实现（sort、flip、gather 等），以及开发教程。

3.4 TileLang-Ascend

地址：gitcode.com/Ascend/tilelang-ascend

定位：基于 TVM 的 DSL，面向昇腾 NPU 的领域特定语言，支持 GEMM、向量运算、注意力机制等算子开发。

3.5 torch_npu

安装方式：pip install torch_npu==2.7.1

定位：PyTorch 与 Ascend NPU 的对接插件。提供 .npu() 方法将 PyTorch tensor 迁移到 Ascend 设备，并桥接 CANN 算子库（ACLNN）。

3.6 CANN

下载地址：昇腾社区 CANN 下载页

定位：Compute Architecture for Neural Networks，昇腾 NPU 的驱动和运行时层。

CANN 核心组件：

组件	全称	职责
AscendCL	Ascend Common Library	算子调用抽象，Host 侧 API
ACLNN	ACL Neural Network	神经网络高频算子库
HCCL	Huawei Collective Communication Library	多卡集合通信（AllReduce 等）

3.7 AscendNPU-IR-Dev

地址：gitcode.com/Ascend/AscendNPU-IR-Dev

定位：AscendNPU-IR 的构建依赖合集，包含 LLVM 和 Torch-MLIR 子模块的特定 commit 快照。

四、IR 降级流水线详解

这是整个技术栈最核心的部分。当用户执行一个 Triton kernel 时，代码经历了以下编译阶段：

阶段 1: Triton 前端
================================================================================
Python @triton.jit
        |
        v  [Triton-Ascend] triton compile
TTIR (Triton IR)
        |
        v  [Triton-Ascend] backend compile
TTGIR (Triton Generic IR)
        |
        v  [Triton-Ascend] 产出
.ttadapter.mlir  --- TTAdapter 格式，昇腾专用适配器

阶段 2: MLIR 降级
================================================================================
.ttadapter.mlir
        |
        +--- bishengir-opt (torch-backend-to-named-op-backend-pipeline)
        |         |
        |         v
        |   Linalg / Tensor Dialect
        |         |
        +--- bishengir-opt (-enable-triton-kernel-compile)
                  |
                  v
            HFusion Dialect  --- 自动融合 / Tiling / Scheduling
                  |
                  v  [HFusionToHIVM Pass]
            HIVM Dialect  --- 显式 GM/UB/L1/L0 内存 + Vector/Cube/MTE 流水线
                  |
                  v  [bishengir-compile]
            .o (NPU Binary)

阶段 3: 运行时
================================================================================
.o (NPU Binary)
        |
        v  [CANN Runtime: AscendCL / ACLNN]
Ascend NPU Hardware

Triton Op -- AscendNPU-IR Op 映射表

这是理解两端对接的关键。Triton 的高层抽象在降级过程中会映射到 AscendNPU-IR 的具体操作：

Triton Op	AscendNPU-IR Op	说明
`triton::DotOp`	`linalg::MatmulOp`	矩阵乘
`triton::LoadOp`	`memref::copy`	内存加载
`triton::StoreOp`	`memref::copy`	内存存储
`triton::BroadcastOp`	`linalg::BroadcastOp`	广播
`triton::TransOp`	`linalg::TransposeOp`	转置
`triton::ReduceOp`	`linalg::ReduceOp`	归约
`triton::GatherOp`	`hfusion::GatherOp`	聚集
`triton::AtomicRMWOp`	`hivm::StoreOp`	原子操作
`triton::AddPtrOp`	`memref::ReinterpretCast`	指针偏移
`triton::ReshapeOp`	`tensor::ReshapeOp`	形状变换
`triton::SplitOp`	`tensor::ExtractSliceOp`	张量分割
`triton::SplatOp`	`linalg::FillOp`	标量填充
`triton::SortOp`	`hfusion::SortOp`	排序

五、项目依赖关系

用户代码 (@triton.jit)
        |
        v  pip install triton-ascend
Triton-Ascend (编译前端)
        |  -----------------------------------------------------------------
        |  依赖: triton (上游)                                               |
        |  依赖: torch_npu (PyTorch 对接)                                    |
        |  依赖: CANN (运行时)                                               |
        v
.ttadapter.mlir (TTAdapter)
        |
        v  bishengir-compile
AscendNPU-IR / bishengir (编译后端)
        |  -----------------------------------------------------------------
        |  依赖: AscendNPU-IR-Dev (构建阶段)                                |
        |  依赖: LLVM (子模块)                                               |
        |  依赖: Torch-MLIR (子模块)                                         |
        |  依赖: CANN (运行时)                                               |
        v
.o (NPU Binary)
        |
        v  CANN Runtime
Ascend NPU Hardware --- torch_npu 运行时对接

六、最小依赖集合

场景 1：只想写 Triton kernel 并在 Ascend 上跑

1. pip install triton-ascend       # Triton Ascend 前端
2. pip install torch_npu==2.7.1     # PyTorch NPU 插件
3. CANN (昇腾驱动+工具链+ops包)      # 昇腾运行时环境

场景 2：需要调试 IR 降级或分析编译问题

4. AscendNPU-IR                     # MLIR 编译器 (bishengir-opt/compile)
5. msprof                            # 华为性能分析工具

场景 3：需要修改 AscendNPU-IR 编译器本身

6. AscendNPU-IR-Dev                 # 构建依赖 (LLVM/Torch-MLIR 子模块)
7. CMake >= 3.28, Ninja >= 1.12.0  # 构建工具链
8. Clang >= 10, LLD >= 10          # 编译器

七、编译命令速查

######## Triton Kernel 编译 ########
# Triton-Ascend 自动处理，用户只需:
@triton.jit
def kernel(x, y, z, ...):
    ...
# 运行时自动调用 bishengir-compile

######## 端到端编译 (TTAdapter -- .o) ########
bishengir-compile kernel.ttadapter.mlir -o kernel.o

######## Torch IR 端到端编译 ########
bishengir-compile -enable-torch-compile=true \
  -enable-hfusion-compile=true \
  -enable-hivm-compile=true \
  -target=Ascend910B1 torch.mlir -o torch_kernel.o

######## HFusion 层级编译 ########
bishengir-compile -enable-hfusion-compile=true \
  -enable-hivm-compile=true \
  -target=Ascend910B1 hfusion.mlir -o hfusion_kernel.o

######## HIVM 层级编译 (最底层) ########
bishengir-compile -enable-hfusion-compile=false \
  -enable-hivm-compile=true \
  -target=Ascend910B1 hivm.mlir -o hivm_kernel.o

######## 逐级查看 IR 降级结果 ########
bishengir-opt -torch-backend-to-named-op-backend-pipeline torch.mlir \
  -o torch_to_hfusion.mlir

八、多级 IR 抽象详解

AscendNPU-IR 的核心价值在于提供了多层抽象，让不同需求的开发者选择合适的接入层：

8.1 Torch IR 层

直接摄入 PyTorch ATen 算子。适合 PyTorch 用户，不需要写 Triton，通过 convert-torch-to-hfusion Pass 降级到 Linalg/HFusion。

8.2 Linalg / HFusion 层

通用张量代数层。HFusion 提供昇腾感知的 Named Op，并自动完成算子融合、Tiling、Scheduling。

8.3 HIVM 层

最低层，直接映射 NPU 硬件指令。通过 #hivm.address_space 注解内存层级（gm/ub/l1/l0a/l0b/l0c），需要显式管理 DMA 传输和 Vector/Cube/MTE 流水线同步。

九、开发调试工具链

工具	用途
`TRITON_ALL_PARALLEL`	环境变量，激活 Auto Blockify 逻辑（自动 tiling）
`bishengir-opt`	MLIR pass 调试，逐级查看 IR 降级
`bishengir-compile`	端到端编译 TTAdapter -- .o
`npu-smi info`	查询 Ascend 设备信息
`msprof`	华为 profiling 工具
DFX (Debug Feature eXtract)	打印调试
`pytest -sv test_xxx.py`	triton-ascend 内置验证

十、总结：技术栈全景

华为 Ascend Triton 技术栈可以概括为「两层 + 两级」：

两层：Triton-Ascend（前端接入）+ AscendNPU-IR（后端编译）
两级：HFusion（高层融合调度）+ HIVM（低层硬件指令）

整个技术栈的设计思路非常清晰：Triton-Ascend 负责对接 Triton 生态，让用户用熟悉的 @triton.jit 方式写 kernel；AscendNPU-IR 负责把 Triton 的高层抽象逐步降级到 Ascend 硬件友好的低层 IR，最终通过 CANN 在昇腾 NPU 上执行。

对于想在 Ascend NPU 上开发高性能算子的工程师，建议从 Triton-Ascend + torch_npu + CANN 的最小集合开始入手，感受端到端的工作流；当需要深入优化时，再逐步引入 AscendNPU-IR 的 IR 调试能力和 msprof 的性能分析能力。

相关项目地址汇总：

Triton-Ascend: gitcode.com/Ascend/triton-ascend
AscendNPU-IR: gitcode.com/Ascend/AscendNPU-IR / github.com/Ascend/AscendNPU-IR
triton-ascend-ops: github.com/Ascend/triton-ascend-ops
CANN 下载: 昇腾社区 CANN 下载页