马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
个人明白后总结:
举例一:
想象:NSA就像是一个“超级大脑”,它在处置惩罚大量信息时,体现得既聪明又高效。想象一下,你是一个在图书馆里试图找到特定信息的侦探。在传统图书馆里,你需要一本一本地翻阅所有书籍,盼望找到那个关键的线索。这不但耗时,而且服从极低。但如果你有一个“超级大脑”,你只需查看那些最有可能包含线索的书籍,乃至直接跳到那几页最紧张的内容上。
NSA的工作方式就像这个“超级大脑”侦探:
压缩(Token Compression):就像侦探先快速浏览书架,NSA首先对信息举行“压缩”,只保留最紧张的部分。这就像侦探在一堆书中,只挑那些看起来最可疑的来快速查看。
选择(Token Selection):接下来,侦探会根据线索的紧张性选择性地查抄几本书。NSA也类似,它会选择性地保留那些对当前任务最有用的信息块,忽略那些不太紧张的。
滑动窗口(Sliding Window):末了,侦探可能会在一本书中仔细查看几个关键段落。NSA通过“滑动窗口”机制,确保它可以或许捕捉到局部上下文中的紧张细节,就像侦探在书中仔细查看那几页一样。
通过这种方式,NSA可以或许像侦探一样,快速且准确地找到它需要的信息,而不需要浪费时间行止置惩罚所有不必要的细节。这不但提高了服从,还确保了它可以或许在关键时刻做出正确的决策——就像侦探在悬疑小说的高潮部分及时找到关键线索一样。
举例二:
想象一下,你正在看一本非常厚的书,你不可能同时记着每一页的内容,但你可以挑几个关键的章节来会合记忆。NSA就是用类似的方法来处置惩罚大量数据的。
为什么需要NSA?
提高服从:传统的留意力机制就像你试图记着书的每一页,这不但慢,而且对计算机来说也很费资源。NSA只关注意要的信息,就像你只记着书的关键章节一样。
节流资源:NSA通过淘汰需要处置惩罚的数据量来节流计算资源,这就像你只保留购物清单上最紧张的东西,而不是每一件商品都买。
NSA怎样工作?
分层稀疏策略:NSA使用一种分层的方法来处置惩罚信息,首先压缩数据,然后选择最紧张的部分,末了通过滑动窗口来处置惩罚局部信息。这类似于你先浏览整个超市(压缩),然后选择最需要的商品(选择),末了查抄购物车(滑动窗口)。
硬件对齐:NSA设计时思量到了计算机硬件的限制,确保它可以或许有效地使用计算机的资源,就像你根据家里的厨房巨细来决定买多大的冰箱。
实验结果
性能提升:实验表明,NSA在多个任务上都比传统的全留意力机制体现得更好,特别是在处置惩罚长文本时。
速度加速:NSA在处置惩罚64k长度的序列时,速度比全留意力快了11.6倍,这就像你在高速路上开车,比在拥堵的市中心快得多。
生存的小例子
想象一下,你在玩一个大型的在线游戏,比如《魔兽世界》。在这个游戏中,你需要同时关注舆图上的多个地区,比如敌人的位置、任务目的、队友的状态等。如果游戏的“留意力机制”是传统的全留意力,它就需要同时处置惩罚所有这些地区的信息,这会导致游戏卡顿,延迟高,你可能还没反应过来就被敌人击败了。
但如果游戏使用了NSA如许的稀疏留意力机制,它就可以只关注最关键的信息,比如敌人的攻击和即将完成的任务,而忽略其他不那么紧张的背景信息。如许,游戏就能更流畅地运行,你的反应也会更快,游戏体验自然会更好。
总的来说,NSA通过智能地选择和处置惩罚信息,提高了处置惩罚大量数据时的服从和性能,同时淘汰了计算资源的斲丧。
论文正文:
第1页部分:
Abstract
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges.
长上下文建模对于下一代语言模型至关紧张,然而标准留意力机制的高计算成本带来了显著的计算寻衅。
Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities.
稀疏留意力提供了一个有盼望的方向,可以在提高服从的同时保持模型能力。
We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling.
我们提出了NSA,一种本地可训练的稀疏留意力机制,它团结了算法创新和硬件对齐优化,以实现高效的长上下文建模。
NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision.
NSA接纳了动态分层稀疏策略,团结了粗粒度的令牌压缩和细粒度的令牌选择,以保持全局上下文意识和局部精度。
Our approach advances sparse attention design with two key innovations:
我们的方法通过两个关键创新推进了稀疏留意力设计:
(1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware.
(1)我们通过算术强度平衡的算法设计和当代硬件的实现优化,实现了显著的速度提升。
(2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance.
(2)我们实现了端到端训练,淘汰了预训练计算,而不影响模型性能。
As shown in Figure 1 experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning.
如图1所示的实验表明,使用NSA预训练的模型在一样平常基准测试、长上下文任务和基于指令的推理中保持或超过了全留意力模型。
Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.
同时,NSA在64k长度序列上的解码、前向流传和反向流传中实现了比全留意力显著的速度提升,验证了其在整个模型生命周期中的服从。
Introduction
The research community increasingly recognizes long-context modeling as a crucial capability for next-generation large language models, driven by diverse real-world applications ranging from in-depth reasoning (DeepSeek-AI 2025; Zelikman et al. 2022), repository-level code generation (Zhang et al. 2023a; Zhang et al.) and multi-turn autonomous agent systems (Park et al. 2023).
研究界越来越认识到长上下文建模是下一代大型语言模型的关键能力,这一能力受到从深入推理(DeepSeek-AI 2025;Zelikman等,2022)、代码库级代码生成(Zhang等,2023a;Zhang等)和多轮次自主代理体系(Park等,2023)等多种实际应用的推动。
Recent breakthroughs, including OpenAI’s o-series models, DeepSeek-R1 (DeepSeek-AI 2025), and Gemini 1.5 Pro (Google et al. 2024), enabling models to process entire codebases, lengthy documents, maintain coherent multi-turn conversations over thousands of tokens, and perform complex reasoning across long-range dependencies.
最近的突破包罗OpenAI的o系列模型、DeepSeek-R1(DeepSeek-AI 2025)和Gemini 1.5 Pro(Google等,2024),使模型可以或许处置惩罚整个代码库、冗长的文档,在数千个令牌上保持连贯的多轮对话,并在长距离依赖上执行复杂的推理。
However, the high complexity (Zaheer et al. 2020) of vanilla Attention (Vaswani et al. 2017) mechanisms emerges as a critical latency bottleneck as sequence length increases.
然而,随着序列长度的增加,普通留意力(Vaswani等,2017)机制的高复杂性(Zaheer等,2020)成为关键的延迟瓶颈。
Theoretical estimates indicate that attention complexity grows quadratically with sequence length, leading to an intractable computational load for long sequences.
理论估计表明,留意力复杂性随着序列长度的增加而呈二次方增长,导致长序列的计算负载难以处置惩罚。
第2页部分:
Figure 1 | Comparison of performance and efficiency between Full Attention model and our NSA.
图1 | 全留意力模型与我们的NSA在性能和服从上的比较。
Left: Despite being sparse, NSA surpasses Full Attention baseline on average across general benchmarks, long-context tasks, and reasoning evaluation.
左图:尽管是稀疏的,NSA在一样平常基准测试、长上下文任务和推理评估中平均超过了全留意力基线。
Right: For 64k-length sequence processing, NSA achieves substantial computational speedup compared to Full Attention in all stages: decoding, forward propagation, and backward propagation.
右图:对于64k长度序列处置惩罚,NSA在所有阶段(解码、前向流传和反向流传)与全留意力相比都实现了显著的计算加速。
A natural approach to efficient long-context modeling is to take advantage of the inherent sparsity of softmax attention (Ge et al. 2023; Jiang et al. 2023), where selectively computing critical query-key pairs can significantly reduce computational overhead while preserving performance.
一种高效的长上下文建模的自然方法是使用softmax留意力的固有稀疏性(Ge等,2023;Jiang等,2023),此中选择性地计算关键的查询-键对可以显著淘汰计算开销,同时保持性能。
Recent advances demonstrate this potential through diverse strategies: KV-cache eviction methods (Li et al., 2024; Zhang et al., 2023b; Zhou et al., 2024), blockwise KV-cache selection methods (Tang et al., 2024; Xiao et al., 2024), and sampling, clustering or hashing-based selection methods (Chen et al., 2024; Desai et al., 2024; Liu et al., 2024).
最近的进步通过多种策略展示了这种潜力:KV缓存驱逐方法(Li等,2024;Zhang等,2023b;Zhou等,2024)、块状KV缓存选择方法(Tang等,2024;Xiao等,2024)以及基于采样、聚类或哈希的选择方法(Chen等,2024;Desai等,2024;Liu等,2024)。
Despite these promising strategies, existing sparse attention methods often fall short in practical deployments.
尽管这些策略很有远景,现有的稀疏留意力方法在实际摆设中常常不尽如人意。
Many approaches fail to achieve speedups comparable to their theoretical gains, also, most methods mainly focus on inference stage, lacking effective training-time support to fully exploit the sparsity patterns of attention.
许多方法未能实现与其理论增益相称的加速,别的,大多数方法主要关注推理阶段,缺乏有效的训练时支持来充实使用留意力的稀疏模式。
To address these limitations, the deployment of effective sparse attention must tackle two key challenges:
为相识决这些限制,有效稀疏留意力的摆设必须解决两个关键寻衅:
(1) Hardware-aligned inference speedup: Converting theoretical computation reductions into actual speed improvements requires hardware-friendly algorithm design during both pre-filling and decoding stages to mitigate memory access and hardware scheduling bottlenecks;
(1)硬件对齐的推理加速:将理论计算淘汰转化为实际速度提升需要在预添补和解码阶段举行硬件友爱的算法设计,以缓解内存访问和硬件调度瓶颈;
(2) Training-aware algorithm design: Enabling end-to-end computation with trainable operators to reduce training costs while maintaining model performance.
(2)训练感知的算法设计:通过可训练的操纵符启用端到端计算,以降低训练成本,同时保持模型性能。
These requirements are crucial for real-world applications to achieve fast long-context inference or training.
这些要求对于实现快速长上下文推理或训练的实际应用至关紧张。
When considering both aspects, existing methods still exhibit a noticeable gap.
在思量这两个方面时,现有方法仍然存在明显的差距。
To achieve more effective and efficient sparse attention, we present NSA, a Natively trainable Sparse Attention architecture that integrates hierarchical token modeling.
为了实现更有效和高效的稀疏留意力,我们提出了NSA,一种本地可训练的稀疏留意力架构,它整合了分层标记建模。
As shown in Figure 2, NSA reduces per-query computation by organizing keys and values into temporal blocks and processing them through three attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information.
如图2所示,NSA通过将键和值构造到时间块中,并通过三个留意力路径处置惩罚它们来淘汰每次查询的计算:压缩的粗粒度标记、选择性保留的细粒度标记和用于局部上下文信息的滑动窗口。
Then we implement specialized kernels to maximize its practical efficiency.NSA introduces two
然后我们实现了专门的内核以最大化其实际服从。NSA引入了两个
第3页部分:
Figure 2 | Overview of NSA's architecture.
图2 | NSA架构概述。
Left: The framework processes input sequences through three parallel attention branches: For a given query, preceding keys and values are processed into compressed attention for coarse-grained patterns, selected attention for important token blocks, and sliding attention for local context.
左图:框架通过三个并行留意力分支处置惩罚输入序列:对于给定的查询,前面的键和值被处置惩罚成压缩留意力以获取粗粒度模式,选择性留意力用于紧张的令牌块,滑动留意力用于局部上下文。
Right: Visualization of different attention patterns produced by each branch. Green areas indicate regions where attention scores need to be computed, while white areas represent regions that can be skipped.
右图:每个分支产生的不同留意力模式的可视化。绿色地区表现需要计算留意力分数的地区,而白色地区表现可以跳过的地区。
core innovations corresponding to the key requirements above: (1) Hardware-aligned system:
与上述关键要求相对应的两个核心创新:(1)硬件对齐体系:
Optimize blockwise sparse attention for Tensor Core utilization and memory access, ensuring balanced arithmetic intensity.
优化块状稀疏留意力以使用Tensor Core和内存访问,确保算术强度平衡。
(2) Training-aware design:
(2) 训练感知设计:
Enable stable end-to-end training through efficient algorithms and backward operators. This optimization enables NSA to support both efficient deployment and end-to-end training.
通过高效算法和反向操纵符实现稳定的端到端训练。这种优化使NSA可以或许同时支持高效摆设和端到端训练。
We evaluate NSA through comprehensive experiments on real-world language corpora.
我们通过对真实语言语料库的全面实验来评估NSA。
Pretraining on a 27B-parameter transformer backbone with 260B tokens, we assess NSA's performance across general language evaluations, long-context evaluations, and chain-of-thought reasoning evaluation.
在具有260B个标记的27B参数变压器骨干上举行预训练,我们评估了NSA在一样平常语言评估、长上下文评估和头脑链推理评估中的性能。
We further compare the kernel speed on A100 GPUs with optimized Triton (lillet et al. 2019) implementations.
我们进一步比较了在A100 GPU上使用优化的Triton(lillet等,2019)实现的内核速度。
Experimental results demonstrate that NSA achieves comparable or superior performance to full attention baseline, while outperforming existing sparse attention approaches.
实验结果表明,NSA实现了与全留意力基线相称或更优的性能,同时优于现有的稀疏留意力方法。
Additionally, NSA delivers substantial speedups across decoding, forward, and backward stages compared to Full Attention, with the speedup ratio increasing for longer sequences.
别的,与全留意力相比,NSA在解码、前向和反向阶段都实现了显著的加速,随着序列长度的增加,加速比也在增加。
These results validate that our hierarchical sparse attention design effectively balances model capability and computational efficiency.
这些结果验证了我们的分层稀疏留意力设计有效地平衡了模型能力和计算服从。
2. Rethinking Sparse Attention Methods
2. 重新思考稀疏留意力方法
Modern sparse attention methods have made significant strides in reducing the theoretical computational complexity of transformer models.
当代稀疏留意力方法在淘汰变压器模型的理论计算复杂性方面取得了显著进展。
However, most approaches predominantly apply sparsity during inference while retaining a pretrained Full Attention backbone, potentially introducing architectural bias that limits their ability to fully exploit sparse attention's advantages.
然而,大多数方法主要在推理过程中应用稀疏性,同时保留预训练的全留意力骨干,可能会引入架构偏差,限制其充实使用稀疏留意力优势的能力。
Before introducing our native sparse architecture, we systematically analyze these limitations through two critical lenses.
在先容我们的本地稀疏架构之前,我们通过两个关键视角体系地分析了这些限制。
2.1. The Illusion of Efficient Inference
2.1. 高效推理的幻觉
Despite achieving sparsity in attention computation, many methods fail to achieve corresponding reductions in inference latency, primarily due to two challenges:
尽管在留意力计算中实现了稀疏性,但许多方法未能实现相应的推理延迟淘汰,主要由于两个寻衅:
Phase-Restricted Sparsity.
阶段限制的稀疏性。
Methods such as H2O (Zhang et al. 2023b) apply sparsity only during the decoding phase, ignoring the inference latency during the forward propagation phase.
像H2O(Zhang等,2023b)如许的方法仅在解码阶段应用稀疏性,忽略了前向流传阶段的推理延迟。
This leads to suboptimal speedups as the majority of computation still occurs in the forward phase.
这导致加速效果不抱负,因为大部分计算仍然发生在前向阶段。
Memory Access Patterns.
内存访问模式。
Sparse attention methods often involve irregular memory access patterns that can lead to inefficient memory utilization and increased latency.
稀疏留意力方法通常涉及不规则的内存访问模式,这可能导致内存使用服从低下和延迟增加。
Methods like KV-cache eviction (Li et al., 2024) and blockwise attention (Zhang et al., 2023a) can result in non-contiguous memory accesses, which modern hardware struggles to optimize.
像KV缓存驱逐(Li等,2024)和块状留意力(Zhang等,2023a)如许的方法可能导致非连续的内存访问,当代硬件难以优化。
第4页部分:
during autoregressive decoding while requiring computationally intensive pre-processing (e.g. attention map calculation, index building) during prefilling.
在自回归解码过程中,同时需要在预添补期间举行计算密集型的预处置惩罚(例如,留意力图计算、索引构建)。
In contrast, approaches like MInference (Jiang et al., 2024) focus solely on prefilling sparsity.
相比之下,像MInference(Jiang等,2024)如许的方法仅专注于预添补稀疏性。
These methods fail to achieve acceleration across all inference stages, as at least one phase remains computational costs comparable to Full Attention.
这些方法未能在所有推理阶段实现加速,因为至少有一个阶段的计算成本与全留意力相称。
The phase specialization reduces the speedup ability of these methods in prefilling-dominated workloads like book summarization and code completion, or decoding-dominated workloads like long chain-of-thought (Wei et al., 2022) reasoning.
阶段专业化降低了这些方法在以预添补为主的工作负载(如书籍择要和代码完成)或以解码为主的工作负载(如长链头脑(Wei等,2022)推理)中的加速能力。
Incompatibility with Advanced Attention Architecture.
与高级留意力架构的不兼容性。
Some sparse attention methods fail to adapt to modern decoding efficient architectures like Multiple-Query Attention (MQA) (Shazeer 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023), which significantly reduced the memory access bottleneck during decoding by sharing KV across multiple query heads.
一些稀疏留意力方法未能适应当代解码高效架构,如多重查询留意力(MQA)(Shazeer 2019)和分组查询留意力(GQA)(Ainslie等,2023),这些架构通过在多个查询头之间共享KV显著淘汰相识码过程中的内存访问瓶颈。
For instance, in approaches like Quest (Tang et al., 2024), each attention head independently selects its KV-cache subset.
例如,在像Quest(Tang等,2024)如许的方法中,每个留意力头独立选择其KV缓存子集。
Although it demonstrates consistent computation sparsity and memory access sparsity in Multi-Head Attention (MHA) models, it presents a different scenario in models based on architectures like GQA, where the memory access volume of KV-cache corresponds to the union of selections from all query heads within the same GQA group.
尽管它在多头留意力(MHA)模型中展示了同等的计算稀疏性和内存访问稀疏性,但在基于GQA等架构的模型中呈现出不同的情况,此中KV缓存的内存访问量对应于同一GQA组内所有查询头选择的并集。
This architectural characteristic means that while these methods can reduce computation operations, the required KV-cache memory access remains relatively high.
这种架构特性意
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |