美丽的神话 发表于 2024-10-15 16:28:31

【双语消息】AGI安全与对齐,DeepMind近期工作

我们想与AF社区分享我们近来的工作总结。以下是关于我们正在做什么,为什么会这么做以及我们以为它的意义所在的一些详细信息。我们渴望这能帮助人们从我们的工作基础上继续发展,并相识他们的工作怎样与我们相关联。
by Rohin Shah, Seb Farquhar, Anca Dragan
21st Aug 2024
AI Alignment Forum


We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours.
我们是谁?

Who are we?
我们是Google DeepMind的主要团队,致力于研究AI系统存在风险的技术方法。自从我们的上一篇文章之后,我们已经发展成为AGI安全与对齐团队,并将其视为AGI对齐(包括机制可表明性、可扩展监督等子团队),以及前沿安全性团队(致力于前沿安全性框架的开发和运行,包括危险能力评估)。我们自前次文章发布以来不停在扩大:客岁增长了39%,今年前半段增长了37%。领导团队由Anca Dragan、Rohin Shah、Allan Dafoe和Dave Orr组成,Shane Legg是实验发起人。我们属于由Anca领导的总体AI安全与对齐组织,该组织还包括Gemini Safety(专注于当前Gemini模型的安全培训),以及Voice of All in Alignment团队,专注于价值和观点多样性的一致性技术。
请注意,我保留了原文中的特殊字符和格式标点符号。
We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we’ve evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We’ve also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We’re part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism.
我们在做什么?

What have we been up to?
以下是自前次更新以来,我们列出的在2023年和2024年的前几个月内发表的一些关键工作,按照主题/子团队分类:
It’s been a while since our last update, so below we list out some key work published in 2023 and the first part of 2024, grouped by topic / sub-team.
在过去1.5年中的庞大投资包括:1)增强监督,以提供精确的学习信号,帮助模型与安全性保持一致,并制止引发灾难性风险;2)前沿安全研究,分析模型是否有能力引发灾难性风险的可能性;3)(本体论的)可表明性,作为实现前沿安全和对齐目的的潜在工具。除了这些投资之外,我们还尝试了一些有前景的范畴和想法,以帮助我们识别应当做新的投资方向。
Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don’t pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make.
界限安全

Frontier Safety
前沿安全团队的任务是_确保从极端伤害中确保安全,通过预见、评估并帮助谷歌准备前沿模型的强盛能力来实现这一点。_固然目前的重点主要在滥用威胁模型上,我们也在研究不一致的威胁模型。
The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat models, we are also working on misalignment threat models.
FSF

FSF

我们近来发布了我们的前沿安全框架,其大抵遵照了负责任能力扩展的方法,雷同于Anthropic的负责任扩展政策和OpenAI的准备性框架。关键的不同之处在于FSF实用于Google:在Google中存在多种不同的前沿LLM摆设,而不但仅是单个聊天机器人和API(这进而影响利益相关者到场、政策实验、缓解计划等)。
We recently published our Frontier Safety Framework, which, in broad strokes, follows the approach of responsible capability scaling, similar to Anthropic’s Responsible Scaling Policy and OpenAI’s Preparedness Framework. The key difference is that the FSF applies to Google: there are many different frontier LLM deployments across Google, rather than just a single chatbot and API (this in turn affects stakeholder engagement, policy implementation, mitigation plans, etc).
我们团队在这一范畴引领了谷歌的全局策略,并且证明了负责任的能力扩展不但可以实用于小型创业公司,同样也实用于大型科技企业。
We’re excited that our small team led the Google-wide strategy in this space, and demonstrated that responsible capability scaling can work for large tech companies in addition to small startups.
在我们试点框架时,FSF重点关注的一个范畴是怎样将关键能力级别(CCL)映射到我们接纳的缓解步伐。这是我们未来版本迭代优先级中的一个重要问题。
A key area of the FSF we’re focusing on as we pilot the Framework, is how to map between the critical capability levels (CCLs) and the mitigations we would take. This is high on our list of priorities as we iterate on future versions.
一些评论(例如在这里)也正确地指出,FSF并未包含承诺。这是因为科学处于初期阶段,最佳实践必要发展。但终极我们关心的是实际工作是否完成。在实践中,我们确实对双子座1.5进行了危险能力评估,并且报告了我们以为足以以高置信度清除极端风险的足够评估。
Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve. But ultimately, what we care about is whether the work is actually done. In practice, we did run and report dangerous capability evaluations for Gemini 1.5 that we think are sufficient to rule out extreme risk with high confidence.
危险能力评估

Dangerous Capability Evaluations

我们的关于《评估前沿模型的危险能力》的文章是最全面的危险能力评估集合,到我们所知的程度,它已经指导了其他组织计划评估。我们定期对前沿模型运行和报告这些评估,包括Gemini 1.0(原论文),Gemini 1.5(见第9.5.2节)以及Gemma 2(见第7.4节)。我们特殊高兴能够通过我们的Gemma 2评估帮助发睁开源共享的准则。我们自豪于当前在评估和FSF实施透明度方面设定的尺度,并渴望看到其他实验室采用雷同的方法。
Our paper on Evaluating Frontier Models for Dangerous Capabilities is the broadest suite of dangerous capability evaluations published so far, and to the best of our knowledge has informed the design of evaluations at other organizations. We regularly run and report these evaluations on our frontier models, including Gemini 1.0 (original paper), Gemini 1.5 (see Section 9.5.2), and Gemma 2 (see Section 7.4). We’re especially happy to have helped develop open sourcing norms through our Gemma 2 evals. We take pride in currently setting the bar on transparency around evaluations and implementation of the FSF, and we hope to see other labs adopt a similar approach.
在此之前,我们通过《极端风险的模型评估》(Model evaluation for extreme risks)一文为危险能力评估设定了基础原则,并在《高级AI模型的整体安全与责任评估》(Holistic Safety and Responsibility Evaluations of Advanced AI Models)中更全面地讨论了计划评估的方法,从当前危害到极端风险。
Prior to that we set the stage with Model evaluation for extreme risks, which set out the basic principles behind dangerous capability evaluation, and also talked more holistically about designing evaluations across present day harms to extreme risks in Holistic Safety and Responsibility Evaluations of Advanced AI Models.
机械可表明性

Mechanistic Interpretability
机制可表明性是我们安全策略的重要组成部分,近来我们深入研究了稀疏自动编码器(SAEs)。我们发布了Gated SAEs和JumpReLU SAEs,这是SAE的新架构,显著提高了重构损失与稀疏性之间的帕雷托前沿。这两篇论文通过盲法研究严格评估了架构变革,展示了效果特征的可表明性并没有退化。顺便说一下,Gated SAEs是我们所知的第一个在具有凌驾10亿参数的大语言模型(Gemma-7B)上扩展并严格评估SAE的工作。
Mechanistic interpretability is an important part of our safety strategy, and lately we’ve focused deeply on Sparse AutoEncoders (SAEs). We released Gated SAEs and JumpReLU SAEs, new architectures for SAEs that substantially improved the Pareto frontier of reconstruction loss vs sparsity. Both papers rigorously evaluate the architecture change by running a blinded study evaluating how interpretable the resulting features are, showing no degradation. Incidentally, Gated SAEs was the first public work that we know of to scale and rigorously evaluate SAEs on LLMs with over a billion parameters (Gemma-7B).
我们也非常高兴地训练并发布了Gemma Scope,一个用于Gemma 2 2B和9B(每层和每个子层)的公开、全面的SAE套件。我们相信Gemma 2位于“小到足以让学术界的研究人员相对容易地进行工作”的甜蜜点与“大到足以展示有趣且可用表明技术研究的高级行为”之间。我们渴望这将使Gemma 2成为学术界/外部机械表明研究中的首选模型,并能够促进更多大胆的表明性研究,而不范围于工业实验室。您可以通过访问Gemma Scope来获取它,并且有一个由Neuronpedia提供支持的交互式Gemma Scope演示,感谢Neuronpedia。
We’ve also been really excited to train and release Gemma Scope, an open, comprehensive suite of SAEs for Gemma 2 2B and 9B (every layer and every sublayer). We believe Gemma 2 sits at the sweet spot of “small enough that academics can work with them relatively easily” and “large enough that they show interesting high-level behaviors to investigate with interpretability techniques”. We hope this will make Gemma 2 the go-to models of choice for academic/external mech interp research, and enable more ambitious interpretability research outside of industry labs. You can access Gemma Scope here, and there’s an interactive demo of Gemma Scope, courtesy of Neuronpedia.
团队在四月份的进展更新中可以看到一系列关于小组研究的小博客文章,链接如下:进展更新。
You can also see a series of short blog posts on smaller bits of research in the team’s progress update in April.
在SAEs之前,我们致力于:
Prior to SAEs, we worked on:


[*]       《电路分析可表明性尺度?从仓鼠的多项选择能力证据》: 这一关键贡献在于表明用于较小模型的电路分析技术具有扩展性:我们得到了大量关于Chinchilla(70B)怎样在知道答案的情况下将答案与对应的字母映射到一起的明确,即对于多项选择问题。
[*]       Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla: The key contribution here was to show that the circuit analysis techniques used in smaller models scaled: we gained significant understanding about how, after Chinchilla (70B) “knows” the answer to a multiple choice question, it maps that to the letter corresponding to that answer.
[*]       究竟探索:尝试在神经元级别上反向工程究竟回忆:只管这项工作未能实现其雄心壮志的目的,即在超置的早期MLP层中机械地明确究竟是怎样计算的,但它确实提供了进一步的证据表明超置正在发生,并否定了关于究竟回忆可能怎样运作的一些简单假设。它还为该范畴的未来工作提供了一些指导原则,例如将早期层视为产生“多令牌嵌入”的方式,这种方式相对独立于先前上下文。
[*]       Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level: While this work didn’t reach its ambitious goal of mechanistically understanding how facts are computed in superposition in early MLP layers, it did provide further evidence that superposition is happening, and falsified some simple hypotheses about how factual recall might work. It also provided guidelines for future work in the area, such as viewing the early layers as producing a “multi-token embedding” that is relatively independent of prior context.
[*]       AtP∗: 一种高效且可扩展的方法,用于定位LLM行为到组件: 在电路发现的关键方面是找到对研究行为至关重要的模型的哪些组件。激活补丁是一种原则性的方法,但对于每个组件都必要单独的操作(雷同于训练模型),而归因补贴则是近似的方法,并且能够与两个前向和一个反向操作同时为所有组件进行操作。本文观察了归因补贴法,诊断并解决了两个问题,展示了其效果的AtP*算法对完整激活补贴提供了令人印象深刻的精良迫近效果。
[*]       AtP∗: An efficient and scalable method for localizing LLM behaviour to components: A crucial aspect of circuit discovery is finding which components of the model are important for the behavior under investigation. Activation patching is the principled approach, but requires a separate pass for each component (comparable to training a model), whereas attribution patching is an approximation, but can be done for every component simultaneously with two forward & one backward pass. This paper investigated attribution patching, diagnosed two problems and fixed them, and showed that the resulting AtP* algorithm is an impressively good approximation to full activation patching.
[*]       ["Tracr:编译变换器作为可表明性实验室"](#tracr%E3%80%90%E7%BC%96%E5%8C%85%E6%9B%B1%E5%8F%98%E6%9C%AC%E4%B8%AD%E7%AB%B6%E7%BA%A2%E5%AE%9A%E5%9B%BE%E6%8D%9F" ""Tracr:编译变换器作为可表明性实验室"")(链接):“让我们能够创建变换器权重,我们知道了模型正在做什么简直切答案,这允许我们将它作为可表明性工具的测试案例。我们已经看到了一些利用Tracr的例子,但其利用的范围并没有如我们所渴望的那样广泛,因为由Tracr生成的模型与在田野训练的模型有很大的不同。(这是工作完成时已知的风险之一,但我们曾渴望这不会成为太大的缺点。)
[*]       Tracr: Compiled Transformers as a Laboratory for Interpretability: Enabled us to create Transformer weights where we know the ground truth answer about what the model is doing, allowing it to serve as a test case for our interpretability tools. We’ve seen a few cases where people used Tracr, but it hasn’t had as much use as we’d hoped for, because Tracr-produced models are quite different from models trained in the wild. (This was a known risk at the time the work was done, but we hoped it wouldn’t be too large a downside.)
放大监督

Amplified Oversight
我们增强的监督工作旨在对AI系统输出效果的所有原因,以及在AI拥有广泛超人类能力时的情况进行监督,这些情况靠近于一个完全相识所有相关理由的人类所能提供的监督。社区通常称之为“可扩展监督”,但我们渴望明确指出,并不一定包括将监督应用到大量不同情境的数量级上,即监控的寄义并不范围于此。
Our amplified oversight work aims to provide supervision on any single situation that is as close as possible to that of a human with complete understanding of all of the reasons that the AI system produced its output - including when the AI has a very broad range of superhuman capabilities. (The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)
辩说的理论工作

Theoretical Work on Debate

在理论层面,原始辩说协议允许一位多项式时间验证者利用最优化论辩者之间的辩说来解决任何PSPACE内的问题。但是我们的AI系统并不最优,并且我们不应假定它们是最优的!即使一个理想的AI能够反驳谎言,如果我们实际训练的AI系统不能做到这一点,也无关紧要。当不诚实的辩说者通过将容易的问题分解为一个受限的实证主义者无法回复但最优诚实AI可以回复的困难子问题来撒谎时,晦涩的论点问题就出现了。
On the theoretical side, the original debate protocol enables a polynomial-time verifier to decide any problem in PSPACE given debates between optimal debaters. But our AI systems are not optimal, and we should not assume they are! It doesn't matter if an optimal AI could refute lies, if the AI systems we train in practice cannot do so. The problem of obfuscated arguments is exactly when a dishonest debater lies by breaking an easy problem down into hard subproblems that an optimal honest AI could answer but a bounded one could not.
双向高效率辩说提供了一种新的辩说协议,它允许一个_多项式时间_诚实策略向以致更有限的裁判证明究竟,即使面对无限制的不诚实策略也是云云。这还没有达到我们的需求:诚实策略的时间复杂度只在人类可判断的论证长度上是多项式的,而我们渴望它在AI可判断的论证长度上是高效的。我们有一些正在进行中的工作,渴望能跨过这个缺口,并且我们以为如果能够跨越这个差距,这将会影响我们在实证工作中研究的协议范例。
Doubly-efficient debate provides a new debate protocol that enables a polynomial-time honest strategy to prove facts to an even more limited judge, even against an unbounded dishonest strategy. This is not quite yet what we want: the honest strategy is only polynomial-time in the length of the human-judgeable argument, whereas we would like it to be efficient in terms of the length of the AI-judgeable argument. We have some work in progress that we hope will cross this gap, and we expect that if we do cross the gap this will influence which protocols we study in our empirical work.
实证工作在辩说中的应用

Empirical Work on Debate

在实证方面,我们进行了仅限于推理的辩说实验,这些实验有助于挑战社区的预期。起首,在存在信息不对称的使掷中,理论表明辩说应该靠近与向裁判提供完整信息一样好(以致更好),而在这类纯推理实验中,辩说的表现要明显差得多。其次,在不存在信息不对称的任务上,拥有辩说信息的弱裁判模型并不能凌驾没有辩说信息的弱裁判模型。第三,我们只发现了有限的证据表明更强的辩说者可以显著提高裁判的正确性——如果想让辩说在长远上取得乐成,这一点至关重要。
On the empirical side, we ran inference-only experiments with debate that help challenge what the community expects. First, on tasks with information asymmetry, theory suggests that debate should be close to as good as (or even better than) giving the judge access to the full information, whereas in these inference-only experiments debate performs significantly worse. Second, on tasks without information asymmetry, weak judge models with access to debates don’t outperform weak judge model without debate. Third, we find only limited evidence that stronger debaters lead to much higher judge accuracy – and we really need to make this be the case for debate to succeed in the long run.
我们的感觉是,这些问题之以是发生,是因为模型在评估辩说方面不太擅长:实际的论点似乎相称好。我们当前的工作正在研究怎样训练我们的LLM法官成为更好的人类裁判员署理,之后我们计划利用辩说步伐对辩手进行微调,并查抄这是否能关闭我们所观察到的差距。
Qualitatively, our sense is that these issues occur because the models are not very good at judging debates: the actual debater arguments seem quite good. Our current work is looking into training our LLM judges to be better proxies of human judges, after which we plan to try finetuning the debaters using the debate protocol, and checking that this closes the gaps we’ve observed.
因果对齐

Causal Alignment
在团队中,一项长期的研究探索了怎样明确因果鼓励可以为计划安全的AI系统提供贡献。因果关系为我们提供了相称通用的工具来明确那些‘试图’实现目的的署剖析做什么,并且提供了它们行为的原因表明。我们开发了算法【发现署理】,可以帮助我们识别可以通过署理视角来明确系统中的哪些部分。理论上,这可以使我们能够通过履历发现具有目的导向的署理,并确定它们在优化什么。
A long-running stream of research in our team explores how understanding causal incentives can contribute to designing safe AI systems. Causality gives us pretty general tools for understanding what agents that are ‘trying’ to achieve goals will do, and provides explanations for how they act. We developed algorithms for discovering agents, which can help us identify which parts of systems can be understood through an agent-lens. In principle, this could allow us to empirically discover goal-directed agents, and determine what they are optimizing for.
我们还表明,因果世界模型是智能体鲁棒性的一个关键方面,这表明一些因果工具可能实用于任何足够强盛的智能体。该论文在2024年ICLR会议上得到了优秀论文提名奖。这项工作继续指导安全缓解步伐的发展,这些步伐通过管理智能体的鼓励来工作,例如基于过程监督的方法。它还可以用于计划一致性查抄,评估署理在情况中的长期行为,扩展我们今天所拥有的短期时间框架的一致性查抄。
We have also shown that causal world models are a key aspect of agent robustness, suggesting that some causal tools are likely to apply to any sufficiently powerful agent. The paper got an Honorable Mention for Best Paper at ICLR 2024. This work continues to inform the development of safety mitigations that work by managing an agent’s incentives, such as methods based on process supervision. It can also be used to design consistency checks that look at long-run behavior of agents in environments, extending the more short-horizon consistency checks we have today.
新兴主题

Emerging Topics
这包括我们进行的一些研究,这些研究不一定属于多年计划的一部分,而是专注于解答一个特定的问题,大概探究一个范畴是否应该成为我们长期关注的重点。这种研究方式已经导致了一些不同的论文产出:
We also do research that isn’t necessarily part of a years-long agenda, but is instead tackling one particular question, or investigating an area to see whether it should become one of our longer-term agendas. This has led to a few different papers:
在2022年末,人们对AI系统存在一种渴望(或至少曾有这种渴望),即大语言模型中只有少数雷同于真理特性的功能。人们渴望找到并列出所有这些功能,并确定哪个功能与“模型的信念”对应,然后利用这个功能来构建一个诚实的AI系统。在《无监督大语言模型知识发现面对的挑战》(Challenges with unsupervised LLM knowledge discovery)这篇论文中,我们旨在通过展示大量的雷同于真理特性的特征(特殊是那些模仿其他智能体信念的特征),有力地反驳这种直觉。我们的目的并未完全实现,这可能是因为所利用的AI系统不敷强盛,无法表现出如许的特征。然而,我们确实展示了存在许多显著的功能,这些功能至少具有与真理特性雷同的否定一致性和平等性,并且“诱骗”了多种无监督知识发现方法。
One alignment hope that people have (or at least had in late 2022) is that there are only a few “truth-like” features in LLMs, and that we can enumerate them all and find the one that corresponds to the “model’s beliefs”, and use that to create an honest AI system. In Challenges with unsupervised LLM knowledge discovery, we aimed to convincingly rebut this intuition by demonstrating a large variety of “truth-like” features (particularly features that model the beliefs of other agents). We didn’t quite hit that goal, likely because our LLM wasn’t strong enough to show such features, but we did show the existence of many salient features that had at least the negation consistency and confidence properties of truth-like features, which “tricked” several unsupervised knowledge discovery approaches.
通过"剖析grokking的电路效率"(arxiv.org/abs/2309.02390),我们深入探究了"深度学习科学"(["深度学习的影响理论",alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning](""深度学习的影响理论",alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning" ""深度学习的影响理论",alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning" ""深度学习的影响理论",alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning" ""深度学习的影响理论",alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning"))。本文试图解答以下问题:在grokking征象中,为何网络的测试性能在持续训练后急剧提高?只管网络已经在训练阶段得到了几乎完美的表现水平。
文中提出了一个令人信服的答案,并通过推测雷同情况中的多个新颖征象验证了这一答案。我们本渴望通过更深入明确训练动态来提升安全性,但遗憾的是,这个渴望并没有得到实现(不过仍有潜力通过这些看法检测到新能力)。因此,我们决定不再在“深度学习科学”范畴投入更多资源,因为另有其他更加有前景的研究方向。只管云云,我们对这一范畴的研究仍然充满热情,并等待看到更多的研究工作。
请注意:这里的翻译保留了原文中的链接和格式化标识符。
Explaining grokking through circuit efficiency was a foray into “science of deep learning”. It tackles the question: in grokking, why does the network’s test performance improve dramatically upon continued training, despite having already achieved nearly perfect training performance? It gives a compelling answer to this question, and validates this answer by correctly predicting multiple novel phenomena in a similar setting. We hoped that better understanding of training dynamics would enable improved safety, but unfortunately that hope has mostly not panned out (though it is still possible that the insights would help with detection of new capabilities). We’ve decided not to invest more in “science of deep learning”, because there are other more promising things to do, but we remain excited about it and would love to see more research on it.
《寻求权力可能是可推测和可训练署理的可能》这篇短文基于寻求权力框架,探究了怎样从学习到的署理目的误化角度出发来构建风险论点。该文章仍然假定人工智能系统在寻求一个目的,但是具体指出这个目的集合与训练期间学得的行为一致。
请注意:翻译效果中包含了一段原文内容和其表明性的中文版本。
返回符号:
Power-seeking can be probable and predictive for trained agents is a short paper building on the power-seeking framework that shows how the risk argument would be made from the perspective of goal misgeneralization of a learned agent. It still assumes that the AI system is pursuing a goal, but specifies that the goal comes from a set of goals that are consistent with the behavior learned during training.
我们下一步计划做什么?

What are we planning next?
当前我们正在积极的工作中,最令人激动和重要的项目之一是对技术AGI安全的自我高层次方法进行修订。固然对前沿安全性、可表明性和强化监督的投资是这一议程的关键组成部分,但这些因素并不一定能够形成一个系统性的风险应对策略。我们正构建一个逻辑框架来分析技术失准风险,并利用这个框架优先规划研究项目,以便更全面地覆盖我们必要克服的挑战集。
Perhaps the most exciting and important project we are working on right now is revising our own high level approach to technical AGI safety. While our bets on frontier safety, interpretability, and amplified oversight are key aspects of this agenda, they do not necessarily add up to a systematic way of addressing risk. We’re mapping out a logical structure for technical misalignment risk, and using it to prioritize our research so that we better cover the set of challenges we need to overcome.
在這一點上,我們特別關注必要解決的重要領域。即使強化監督檢查的表現完全符合渴望,這也可能不足以確保一致性。在分佈變化的情況下,AI系統可能會以放大監督檢查無法支持的方式運行,正如我們之前在目標泛化中研究過的那樣。要應對這種情況,必要投資於敵對訓練、不確定性估計、監測等;我們渴望通過控制框架的部分評估這些緩解步伐。
As part of that, we’re drawing attention to important areas that require addressing. Even if amplified oversight worked perfectly, that is not clearly sufficient to ensure alignment. Under distribution shift, the AI system could behave in ways that amplified oversight wouldn’t endorse, as we have previously studied in goal misgeneralization. Addressing this will require investments in adversarial training, uncertainty estimation, monitoring, and more; we hope to evaluate these mitigations in part through the control framework.
我们等待着当我们的想法准备好担当反馈和讨论时,与您分享更多。感谢您的到场,并对我们工作的质量、知识体系和行动尺度提出高尺度。
We’re looking forward to sharing more of our thoughts with you when they are ready for feedback and discussion. Thank you for engaging and for holding us to high standards for our work, epistemics, and actions.
   参考资料       上一篇文章: https://www.alignmentforum.org/posts/nzmCvRvPm4xJuqztv/deepmind-is-hiring-for-the-scalable-alignment-and-alignment
   前沿安全性框架: https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/
   last post: https://www.alignmentforum.org/posts/nzmCvRvPm4xJuqztv/deepmind-is-hiring-for-the-scalable-alignment-and-alignment
   Frontier Safety Framework: https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/
   前沿安全框架: https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/
   负责任扩展政策: https://www.anthropic.com/news/anthropics-responsible-scaling-policy
   准备性框架: https://openai.com/preparedness/
   Frontier Safety Framework: https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/
   responsible capability scaling: https://www.gov.uk/government/publications/emerging-processes-for-frontier-ai-safety/emerging-processes-for-frontier-ai-safety#responsible-capability-scaling
   Responsible Scaling Policy: https://www.anthropic.com/news/anthropics-responsible-scaling-policy
   Preparedness Framework: https://openai.com/preparedness/
   在这里: https://www.lesswrong.com/posts/y8eQjQaCamqdc842k/deepmind-s-frontier-safety-framework-is-weak-and-unambitious
   here: https://www.lesswrong.com/posts/y8eQjQaCamqdc842k/deepmind-s-frontier-safety-framework-is-weak-and-unambitious
   Evaluating Frontier Models for Dangerous Capabilities: https://arxiv.org/pdf/2403.13793
   Gemini 1.5: https://arxiv.org/abs/2403.05530
   Gemma 2: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf
   Model evaluation for extreme risks: https://arxiv.org/pdf/2305.15324
   Holistic Safety and Responsibility Evaluations of Advanced AI Models: https://arxiv.org/pdf/2404.14068
   Model evaluation for extreme risks: https://arxiv.org/pdf/2305.15324
   Holistic Safety and Responsibility Evaluations of Advanced AI Models: https://arxiv.org/pdf/2404.14068
   Gated SAEs: https://arxiv.org/abs/2404.16014
   JumpReLU SAEs: https://arxiv.org/abs/2407.14435
   Gated SAEs: https://arxiv.org/abs/2404.16014
   JumpReLU SAEs: https://arxiv.org/abs/2407.14435
   Gemma Scope: https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/
   Gemma Scope: https://huggingface.co/google/gemma-scope
   Neuronpedia提供支持的交互式Gemma Scope演示: https://www.neuronpedia.org/gemma-scope
   Neuronpedia: https://www.neuronpedia.org/
   Gemma Scope: https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/
   here: https://huggingface.co/google/gemma-scope
   interactive demo of Gemma Scope: https://www.neuronpedia.org/gemma-scope
   Neuronpedia: https://www.neuronpedia.org/
   进展更新: https://www.alignmentforum.org/posts/HpAr8k74mW4ivCvCu/progress-update-from-the-gdm-mech-interp-team-summary
   progress update: https://www.alignmentforum.org/posts/HpAr8k74mW4ivCvCu/progress-update-from-the-gdm-mech-interp-team-summary
   《电路分析可表明性尺度?从仓鼠的多项选择能力证据》: https://arxiv.org/pdf/2307.09458
   Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla: https://arxiv.org/pdf/2307.09458
   究竟探索:尝试在神经元级别上反向工程究竟回忆: https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
   Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level: https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
   AtP∗: 一种高效且可扩展的方法,用于定位LLM行为到组件: https://arxiv.org/pdf/2403.00745
   AtP∗: An efficient and scalable method for localizing LLM behaviour to components: https://arxiv.org/pdf/2403.00745
   链接: https://proceedings.neurips.cc/paper_files/paper/2023/file/771155abaae744e08576f1f3b4b7ac0d-Paper-Conference.pdf
   Tracr: Compiled Transformers as a Laboratory for Interpretability: https://proceedings.neurips.cc/paper_files/paper/2023/file/771155abaae744e08576f1f3b4b7ac0d-Paper-Conference.pdf
   原始辩说协议: https://arxiv.org/abs/1805.00899
   晦涩的论点: https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem
   original debate protocol: https://arxiv.org/abs/1805.00899
   obfuscated arguments: https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem
   双向高效率辩说: https://arxiv.org/pdf/2311.14125
   Doubly-efficient debate: https://arxiv.org/pdf/2311.14125
   inference-only experiments with debate: https://arxiv.org/pdf/2407.04622
   【发现署理】: https://arxiv.org/abs/2208.08345
   discovering agents: https://arxiv.org/abs/2208.08345
   causal world models are a key aspect of agent robustness: https://arxiv.org/abs/2402.10877
   alignment hope: https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
   Challenges with unsupervised LLM knowledge discovery: https://arxiv.org/pdf/2312.10029
   arxiv.org/abs/2309.02390: https://arxiv.org/abs/2309.02390
   Explaining grokking through circuit efficiency: https://arxiv.org/abs/2309.02390
   science of deep learning: https://www.alignmentforum.org/posts/tKYGvA9dKHa3GWBBk/theories-of-impact-for-science-of-deep-learning
   Power-seeking can be probable and predictive for trained agents: https://arxiv.org/pdf/2304.06528
   power-seeking framework: https://proceedings.neurips.cc/paper_files/paper/2022/file/cb3658b9983f677670a246c46ece553d-Paper-Conference.pdf
   目標泛化: https://arxiv.org/abs/2210.01790
   控制框架: https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
   goal misgeneralization: https://arxiv.org/abs/2210.01790
   control framework: https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
页: [1]
查看完整版本: 【双语消息】AGI安全与对齐,DeepMind近期工作