ToB企服应用市场:ToB评测及商务社交产业平台

标题: 面临威胁的人工智能代理综述(AI Agent):关键安全挑衅与未来途径综述 [打印本页]

作者: 鼠扑    时间: 2025-1-4 05:20
标题: 面临威胁的人工智能代理综述(AI Agent):关键安全挑衅与未来途径综述
今天带来一篇关于AI Agent的威胁综述。近期在Arxiv上出现的,这里翻译过来学习记录。
原文地址:https://arxiv.org/abs/2406.02630
作者及单位:
ZEHANG DENG∗, Swinburne University of Technology, Australia YONGJIAN GUO∗, Tianjin Univeristy, China CHANGZHOU HAN, Swinburne University of Technology, Australia WANLUN MA†, Swinburne University of Technology, Australia JUNWU XIONG, Ant Group, China SHENG WEN, Swinburne University of Technology, Australia YANG XIANG, Swinburne University of Technology, Australia
正文:

An Artificial Intelligence (AI) agent is a software entity that autonomously performs tasks or makes decisions based on pre-defined objectives and data inputs. AI agents, capable of perceiving user inputs, reasoning and planning tasks, and executing actions, have seen remarkable advancements in algorithm development and task performance. However, the security challenges they pose remain under-explored and unresolved. This survey delves into the emerging security threats faced by AI agents, categorizing them into four critical knowledge gaps: unpredictability of multi-step user inputs, complexity in internal executions, variability of operational environments, and interactions with untrusted external entities. By systematically reviewing these threats, this paper highlights both the progress made and the existing limitations in safeguarding AI agents. The insights provided aim to inspire further research into addressing the security threats associated with AI agents, thereby fostering the development of more robust and secure AI agent applications.
人工智能(AI)代理是一个软件实体,它可以自主实行使命或根据预界说的目的和数据输入做出决策。人工智能代理可以或许感知用户输入,推理和规划使命以及实行动作,在算法开发和使命性能方面取得了显着进步。然而,它们构成的安全挑衅仍旧没有得到充分探究和解决。 该调查深入研究了人工智能代理面临的新出现的安全威胁,将其分为四个关键的知识差距:多步用户输入的不可预测性,内部实行的复杂性,操作情况的可变性以及与不受信托的外部实体的交互。通过系统地回顾这些威胁,本文夸大了在保护AI代理方面取得的进展和存在的局限性。提供的看法旨在引发进一步研究解决与AI代理相干的安全威胁,从而促进更强大和安全的AI代理应用步伐的开发。
1 Introduction 1引言

AI agents are computational entities that demonstrate intelligent behavior through autonomy, reactivity, proactiveness, and social ability. They interact with their environment and users to achieve specific goals by perceiving inputs, reasoning about tasks, planning actions, and executing tasks using internal and external tools. AI agents, powered by large language models (LLMs) such ∗Both authors contributed equally to this research. †Corresponding author.
AI代理是通过自主性,反应性,自动性和社交本事展示智能行为的计算实体。他们与情况和用户交互,通过感知输入、推理使命、规划行动以及使用内部和外部工具实行使命来实现特定目的。
Despite the significant advancements in AI agents, their increasing sophistication also introduces new security challenges. Ensuring AI agent security is crucial due to their deployment in diverse and critical applications. AI agent security refers to the measures and practices aimed at protecting AI agents from vulnerabilities and threats that could compromise their functionality, integrity, and safety. This includes ensuring the agents can securely handle user inputs, execute tasks, and interact with other entities without being susceptible to malicious attacks or unintended harmful behaviors. These security challenges stem from four knowledge gaps that, if unaddressed, can lead to vulnerabilities [27, 97, 112, 192] and potential misuse [132].
尽管人工智能代理取得了显着进步,但其日益复杂也带来了新的安全挑衅。确保人工智能代理的安全性至关重要,因为它们部署在各种关键应用步伐中。AI代理安全是指旨在保护AI代理免受可能危及其功能,完备性和安全性的毛病和威胁的措施和实践。这包括确保代理可以安全地处置处罚用户输入,实行使命,并与其他实体进行交互,而不会受到恶意攻击或不测有害行为的影响。这些安全挑衅源于四个知识差距,假如不加以解决,可能导致毛病[27,97,112,192]和潜在的滥用[132]。

As depicted in Figure 1, the four main knowledge gaps in AI agent are 1) unpredictability of multistep user inputs, 2) complexity in internal executions, 3) variability of operational environments, and 4) interactions with untrusted external entities. The following points delineate the knowledge gaps in detail.
如图1所示,人工智能代理中的四个主要知识缺口是:1)多步用户输入的不可预测性,2)内部实行的复杂性,3)操作情况的可变性,以及4)与不受信托的外部实体的交互。以下几点详细说明了知识差距。

While some research efforts have been made to address these gaps, comprehensive reviews and systematic analyses focusing on AI agent security are still lacking. Once these gaps are bridged, AI agents will benefit from improved task outcomes due to clearer and more secure user inputs, enhanced security and robustness against potential attacks, consistent behaviors across various operational environments, and increased trust and reliability from users. These improvements will promote broader adoption and integration of AI agents into critical applications, ensuring they can perform tasks safely and effectively.
虽然已经做出了一些研究积极来解决这些差距,但仍旧缺乏针对人工智能代理安全的全面审查和系统分析。一旦这些差距被弥合,人工智能代理将受益于更清晰和更安全的用户输入,增强的安全性和针对潜在攻击的鲁棒性,在各种操作情况中的一致行为,以及用户的信托和可靠性。这些改进将促进人工智能代理更广泛的接纳和集成到关键应用步伐中,确保它们可以或许安全有效地实行使命。
Existing surveys on AI agents [87, 105, 160, 186, 211] primarily focus on their architectures and applications, without delving deeply into the security challenges and solutions. Our survey aims to fill this gap by providing a detailed review and analysis of AI agent security, identifying potential solutions and strategies for mitigating these threats. The insights provided are intended to inspire further research into addressing the security threats associated with AI agents, thereby fostering the development of more robust and secure AI agent applications.
现有的关于人工智能代理的调查[87,105,160,186,211]主要集中在它们的架构和应用步伐上,而没有深入研究安全挑衅和解决方案。我们的调查旨在弥补这一空白,提供对人工智能代理安全的详细审查和分析,确定缓解这些威胁的潜在解决方案和策略。所提供的看法旨在引发进一步研究解决与AI代理相干的安全威胁,从而促进更强大和安全的AI代理应用步伐的开发。
In this survey, we systematically review and analyze the threats and solutions of AI agent security based on four knowledge gaps, covering both the breadth and depth aspects. We primarily collected papers from top AI conferences, top cybersecurity conferences, and highly cited arXiv papers, spanning from January 2022 to April 2024. AI conferences are included, but not limited to: NeurIPs, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, and IJCAI. Cybersecurity conferences are included but not limited: IEEE S&, USENIX Security, NDSS, ACM CCS.
在本次调查中,我们基于四个知识缺口,从广度和深度两个方面,系统地回顾和分析了人工智能主体安全的威胁和解决方案。我们主要网络了2022年1月至2024年4月期间的顶级人工智能集会、顶级网络安全集会和高引用arXiv论文。AI集会包括但不限于:NeurIPs,ICML,ICLR,ACL,EMNLP,CVPR,ICCV和IJCAI。网络安全集会包括但不限于:IEEE S&,USENIX Security,NDSS,ACM CCS。
The paper is organized as follows. Section 2 introduces the overview of AI agents. Section 3 depicts the single-agent security issue associated with Gap 1 and Gap 2. Section 4 analyses multi-agent security associated with Gap 3 and Gap 4. Section 5 offers future directions for the development of this field.
本文的结构如下。第2节先容了AI代理的概述。第3节描述了与Gap 1Gap 2相干的单代理安全问题。第4节分析了与Gap 3Gap 4相干的多代理安全性。第5节为这一领域的发展提供了未来的方向。
2 Overview Of Ai Agent AI Agent 同一概念框架下的AI Agent概述

Terminologies. To facilitate understanding, we introduce the following terms in this paper.
术语。为了便于明白,我们在本文中先容了以下术语。

Reasoning refers to a large language model designed to analyze and deduce information, helping to draw logical conclusions from given prompts. Planning, on the other hand, denotes a large language model tailored to assist in devising strategies and making decisions by evaluating possible outcomes and optimizing for specific objectives. The combination of LLMs for planning and reasoning is called the brain. External Tool callings are together named as the action. We name the combination of perception, brain, and action as Intra-execution in this survey. On the other hand, except for intra-execution, AI agents can interact with other AI agents, memories, and environments; we call it Interaction. These terminologies also could be explored in detail at [186].
推理是指一种大型语言模型,旨在分析和推断信息,资助从给定的提示中得出逻辑结论。另一方面,规划表现一个大型语言模型,用于通过评估可能的结果和优化特定目的来资助设计策略和做出决策。用于计划和推理的LLMs的组合被称为大脑外部工具调用一起定名为操作。我们将感知、大脑和行动的结合称为内部实行。另一方面,除了内部实行,AI代理可以与其他AI代理,影象和情况交互;我们称之为交互。这些术语也可以在[186]中详细探究。
In 1986, a study by Mukhopadhyay et al. [116] proposed multiple intelligent node document servers to efficiently retrieve knowledge from multimedia documents through user queries. The following work [10] also discovered the potential of computer assistants by interacting between the user and the computing system, highlighting significant research and application directions in the field of computer science. Subsequently, Wooldridge et al. [183] defined the computer assistant that demonstrates intelligent behavior as an agent. In the developing field of artificial intelligence, the agent is then introduced as a computational entity with properties of autonomy, reactivity, pro-activeness, and social ability [186]. Nowadays, thanks to the powerful capacity of large language models, the AI agent has become a predominant tool to assist users in performing tasks efficiently. As shown in Figure 2, the general workflow of AI agents typically comprises two core components: Intra-execution and Interaction. Intra-execution of the AI agent typically indicates the functionalities running within the single-agent architecture, including perception, brain, and action. Specifically, the perception provides brain with effective inputs, and the action deals with these inputs in subtasks by the LLM reasoning and planning capacities. Then, these subtasks are run sequentially by the action to invoke the tools. ① and ② indicates the iteration processes of the intra-execution. Interaction refers to the ability of an AI agent to engage with other external entities, primarily through external resources. This includes collaboration or competition within the multi-agent architecture, retrieval of memory during task execution, and the deployment of environment and its data use from external tools. Note that in this survey, we define memory as an external resource because the majority of memory-related security risks arise from the retrieval of external resources.
1986年,Mukhopadhyay et al. [116]提出了多个智能节点文档服务器,以通过用户查询从多媒体文档中有效地检索知识。接下来的工作[10]也发现了计算机助理通过用户和计算系统之间的交互的潜力,突出了计算机科学领域的重要研究和应用方向。随后,Wooldridge et al. [183]界说了一个计算机助理,它将智能行为体现为一个代理。在人工智能的发展领域中,智能体被引入作为具有自主性,反应性,自动性和社会本事的计算实体[186]。如今,由于大型语言模型的强大功能,人工智能代理已成为资助用户高效实行使命的主要工具。 如图2所示,AI代理的一般工作流程通常包括两个核心组件:内部实行和交互。AI代理的内部实行通常表现在单代理架构内运行的功能,包括感知,大脑和动作。详细而言,感知为大脑提供有效的输入,行动通过LLM推理和规划本事在子使掷中处置处罚这些输入。LLM然后,这些子使命由操作顺序运行以调用工具。①和②表现内部实行的迭代过程。交互是指人工智能主体与其他外部实体进行交互的本事,主要是通过外部资源。 这包括多代理架构中的协作或竞争,使命实行期间的内存检索,以及情况的部署及其外部工具的数据使用。请注意,在本调查中,我们将内存界说为外部资源,因为大多数与内存相干的安全风险都来自外部资源的检索。
AI agents can be divided into reinforcement-learning-based agents and LLM-based agents from the perspective of their core internal logic. RL-based agents use reinforcement learning to learn and optimize strategies through environment interaction, with the aim of maximizing accumulated rewards. These agents are effective in environments with clear objectives such as instruction following [75, 124] or building world model [108, 140], where they adapt through trial and error.
人工智能主体从其核心内部逻辑的角度可以分为基于重复学习的主体和基于LLM主体。基于RL的代理使用强化学习来学习和优化策略,通过情况交互,以最大化累积的奖励为目的。这些代理在具有明确目的的情况中是有效的,比方遵循指令[75,124]或建立天下模型[108,140],在那里他们通过试错来适应。
In contrast, LLM-based agents rely on large-language models [92, 173, 195]. They excel in natural language processing tasks, leveraging vast textual data to master language complexities for effective communication and information retrieval. Each type of agent has distinct capabilities to achieve specific computational tasks and objectives.
相比之下,基于LLM的代理依靠于大型语言模型[92,173,195]。他们擅长自然语言处置处罚使命,使用大量的文本数据来掌握语言的复杂性,以实现有效的沟通和信息检索。每种范例的代理都有不同的本事来实现特定的计算使命和目的。
2.2 Overview Of Ai Agent On Threats . AI Agent对威胁的研究综述

As of now, there are several surveys on AI agents [87, 105, 160, 186, 211]. For instance, Xi et al. [186] offer a comprehensive and systematic review focused on the applications of LLM-based agents, aiming to examine existing research and future possibilities in this rapidly developing field. The literature [105] summarized the current AI agent architecture. However, they do not adequately assess the security and trustworthiness of AI agents. Li et al. [87] failed to consider both the capability and security of multi-agent scenario. A study [160] provides the potential risks inherent only to scientific LLM agents. Zhang et al. [211] only survey on the memory mechanism of AI agents.
到目前为止,有几项关于AI代理的调查[87,105,160,186,211]。比方,Xi et al. [186]提供了一个全面和系统的审查集中在LLM为底子的代理商,旨在检查现有的研究和未来的可能性,在这个迅速发展的领域。文献[105]总结了当前的AI代理架构。然而,他们没有充分评估人工智能代理的安全性和可信度。Li等人[87]未能同时考虑多代理场景的本事和安全性。一项研究[160]提供了仅科学LLM试剂固有的潜在风险。Zhang等人[211]仅对人工智能主体的影象机制进行了综述。

Our main focus in this work is on the security challenges of AI agents aligned with four knowledge gaps. As depicted in Table 1, we have provided a summary of papers that discuss the security challenges of AI agents. Threat Source column identifies the attack strategies employed at various stages of the general AI agent workflow, categorized into four gaps. Threat Model column identifies potential adversarial attackers or vulnerable entities. Target Effects summarize the potential outcomes of security-relevant issues.
我们在这项工作中的主要重点是AI代理的安全挑衅与四个知识差距。如表1所示,我们提供了讨论AI代理安全挑衅的论文择要。威胁来源列确定了一般AI代理工作流程的各个阶段所接纳的攻击策略,分为四个差距。威胁模型列辨认潜在的敌对攻击者或脆弱实体。目的效应总结了安全相干问题的潜在结果。
We also provide a novel taxonomy of threats to the AI agent (See Figure 3). Specifically, we identify threats based on their source positions, including intra-execution and interaction.
我们还为AI代理提供了一种新的威胁分类(见图3)。详细而言,我们根据其来源位置(包括内部实行和**交互)**辨认威胁。

3 Intra-Execution Security

3内部实行安全

As mentioned in Gap 1 and 2, the single agent system has unpredictable multi-step user inputs and complex internal executions. In this section, we mainly explore these complicated intra-execution threats and their corresponding countermeasures. As depicted in Figure 2, we discuss the threats of the three main components of the unified conceptual framework on the AI agent.
如差距1和2中所述,单代理系统具有不可预测的多步用户输入和复杂的内部实行。在这一部门中,我们主要探究这些复杂的内部实行威胁及其相应的对策。如图2所示,我们讨论了同一概念框架的三个主要组件对AI代理的威胁。
3.1 Threats On Perception

3.1感知威胁

As illustrated in Figure 2 and Gap 1, to help the brain module understand system instruction, user input, and external context, the perception module includes multi-modal (i.e., textual, visual, and auditory inputs) and multi-step (i.e., initial user inputs, intermediate sub-task prompts, and human feedback) data processing during the interaction between humans and agents. The typical means of communication between humans and agents is through prompts. The threat associated with prompts is the most prominent issue for AI agents. This is usually named adversarial attacks. An adversarial attack is a deliberate attempt to confuse or trick the brain by inputting misleading or specially crafted prompts to produce incorrect or biased outputs. Through adversarial attacks, malicious users extract system prompts and other information from the contextual window [46]. Liu et al. [94] were the first to investigate adversarial attacks against the embodied AI agent, introducing spatiotemporal perturbations to create 3D adversarial examples that result in agents providing incorrect answers. Mo et al. [110] analyzed twelve hypothetical attack scenarios against AI agents based on the different threat models. The adversarial attack on the perception module includes prompt injection attacks [23, 49, 49, 130, 185, 196], indirect prompt injection attacks [23, 49, 49, 130, 185, 196] and jailbreak [15, 50, 83, 161, 178, 197]. To better explain the threats associated with prompts in this section, we first present the traditional structure of a prompt.
如图2和差距1所示,为了资助大脑模块明白系统指令、用户输入和外部上下文,感知模块包括多模态(文本、视觉和听觉输入)和多步骤(即,初始用户输入、中央子使命提示和人反馈)在人和代理之间的交互期间的数据处置处罚。人类和代理之间的典范通信方式是通过提示。与提示相干的威胁是AI代理最突出的问题。这通常被称为对抗性攻击。对抗性攻击是一种故意试图通过输入误导性或特制的提示来肴杂或欺骗大脑,以产生不精确或有偏见的输出。通过对抗性攻击,恶意用户从上下文窗口中提取系统提示和其他信息[46]。Liu等人[94]是第一个研究针对详细AI代理的对抗性攻击的人,引入时空扰动来创建3D对抗性示例,导致代理提供不精确的答案Mo等人[110]根据不同的威胁模型分析了12种针对AI代理的假设攻击场景。对感知模块的对抗性攻击包括提示注入攻击[23,49,49,130,185,196]、间接提示注入攻击[23,49,49,130,185,196]和越狱[15,50,83,161,178,197]。为了更好地解释本节中与提示相干的威胁,我们起首先容提示的传统结构。
The agent prompt structure can be composed of instruction, external context, user input. Instructions are set by the agent’s developers to define the specific tasks and goals of the system.
代理提示结构可以由指令、外部上下文、用户输入构成。指令由代理的开发职员设置,以界说系统的特定使命和目的。
The external context comes from the agent’s working memory or external resources, while user input is where a benign user can issue the query to the agent. In this section, the primary threats of jailbreak and prompt injection attacks originate from the instructions and user input, while the threats of indirect injection attacks stem from external contexts.
外部上下文来自代理的工作内存或外部资源,而用户输入是良性用户可以向代剃头出查询的地方。在本节中,越狱和提示注入攻击的主要威胁来自指令和用户输入,而间接注入攻击的威胁来自外部上下文。
3.1.1 Prompt Injection Attack.

3.1.1即时注入攻击。

The prompt injection attack is a malicious prompt manipulation technique in which malicious text is inserted into the input prompt to guide a language model to produce deceptive output [130]. Through the use of deceptive input, prompt injection attacks allow attackers to effectively bypass constraints and moderation policies set by developers of AI agents, resulting in users receiving responses containing biases, toxic content, privacy threats, and misinformation [72]. For example, malicious developers can transform Bing chat into a phishing agent [49]. The UK Cyber Agency has also issued warnings that malicious actors are manipulating the technology behind LLM chatbots to obtain sensitive information, generate offensive content, and trigger unintended consequences [61].
提示注入攻击是一种恶意提示操作技能,此中恶意文本被插入到输入提示中,以引导语言模型产生欺骗性输出[130]。通过使用欺骗性输入,即时注入攻击允许攻击者有效地绕过AI代理开发职员设置的约束和考核策略,导致用户收到包含偏见,有毒内容,隐私威胁和错误信息的相应。比方,恶意开发职员可以将Bing聊天转换为钓鱼代理[49]。英国网络管理局还发出警告,恶意行为者正在操纵LLM聊天机器人背后的技能,以获取敏感信息,生成攻击性内容,并引发意想不到的结果[61]。
The following discussion focuses primarily on the goal hijacking attack and the prompt leaking attack, which represent two prominent forms of prompt injection attacks [130], and the security threats posed by such attacks within AI agents.
下面的讨论主要集中在目的劫持攻击和即时泄漏攻击,这是即时注入攻击的两种主要形式[130],以及这些攻击在AI代理中构成的安全威胁。

3.1.1 Prompt injection attacks within agent-integrated frameworks.

3.1.1 代理集成框架内的快速注入攻击。

With the widespread adoption of AI agents, certain prompt injection attacks targeting individual AI agents can also generalize to deployments of AI agent-based applications [163], amplifying the associated security threats [97, 127]. For example, malicious users can achieve Remote Code Execution (RCE) through prompt injection, thereby remotely acquiring permissions for integrated applications [96]. Additionally, carefully crafted user inputs can induce AI agents to generate malicious SQL queries, compromising data integrity and security [127]. Furthermore, integrating these attacks into corresponding webpages alongside the operation of AI agents [49] leads to users receiving responses that align with the desires of the malicious actors, such as expressing biases or preferences towards products [72].
随着人工智能代理的广泛接纳,针对单个人工智能代理的某些即时注入攻击也可以推广到基于人工智能代理的应用步伐的部署[163],放大了相干的安全威胁[97,127]。比方,恶意用户可以通过提示注入实现远程代码实行(RCE),从而远程获取集成应用步伐的权限[96]。别的,精心制作的用户输入可能会诱导AI代理生成恶意SQL查询,从而损害数据完备性和安全性[127]。别的,将这些攻击与AI代理的操作一起集成到相应的网页中[49]会导致用户收到与恶意行为者的期望相一致的相应,比方表达对产品的偏见或偏好[72]。
In the case of closed-source AI agent integrated commercial applications, certain black-box prompt injection attacks [97] can facilitate the theft of service instruction [193], leveraging the computational capabilities of AI agents for zero-cost imitation services, resulting in millions of dollars in losses for service providers [97].
在闭源AI代理集成贸易应用的情况下,某些黑盒提示注入攻击[97]可以促进服务指令的盗取[193],使用AI代理的计算本事进行零本钱模拟服务,导致服务提供商损失数百万美元。
AI agents are susceptible to meticulously crafted prompt injection attacks [193], primarily due to conflicts between their security training and user instruction objectives [212]. Additionally, AI agents often prioritize system prompts on par with texts from untrusted users and third parties [168]. Therefore, establishing hierarchical instruction privileges and enhancing training methods for these models through synthetic data generation and context distillation can effectively improve the robustness of AI agents against prompt injection attacks [168]. Furthermore, the security threats posed by prompt injection attacks can be mitigated by various techniques, including inference-only methods for intention analysis [209], API defenses with added detectors [68], and black-box defense techniques involving multi-turn dialogues and context examples [3, 196].
AI代理容易受到精心制作的即时注入攻击[193],主要是由于其安全培训和用户指令目的之间的辩论[212]。别的,人工智能代理通常将系统提示与来自不受信托的用户和第三方的文本进行优先级排序[168]。因此,通过合成数据生成和上下文蒸馏建立分层指令特权并增强这些模型的练习方法可以有效提高AI代理对即时注入攻击的鲁棒性[168]。别的,可以通过各种技能来减轻即时注入攻击带来的安全威胁,包括用于意图分析的仅推理方法[209],添加检测器的API防御[68]以及涉及多轮对话和上下文示例的黑盒防御技能[3,196]。
To address the security threats inherent in agent-integrated frameworks, researchers have proposed relevant potential defensive strategies. Liu et al. [96] introduced LLMSMITH, which performs static analysis by scanning the source code of LLM-integrated frameworks to detect potential Remote Code Execution (RCE) vulnerabilities. Jiang et al. [72] proposed four key attributesintegrity, source identification, attack detectability, and utility preservation-to define secure LLMintegrated applications and introduced the shield defense to prevent manipulation of queries from users or responses from AI agents by internal and external malicious actors.
为相识决代理集成框架中固有的安全威胁,研究职员提出了相干的潜在防御策略。Liu等人[96]引入了LLMSMITH,它通过扫描LLM集成框架的源代码来实行静态分析,以检测潜在的远程代码实行(RCE)毛病。Jiang等人[72]提出了四个关键属性完备性,源辨认,攻击可检测性和实用步伐验证-以界说安全的LLM集成应用步伐,并引入了屏蔽防御以防止内部和外部恶意行为者操纵来自用户的查询或来自AI代理的相应。
3.1.2 Indirect Prompt Injection Attack.

3.1.2 间接快速注入攻击。

Indirect prompt injection attack [49] is a form of attack where malicious users strategically inject instruction text into information retrieved by AI agents [40], web pages [184], and other data sources. This injected text is often returned to the AI agent as internal prompts, triggering erroneous behavior, and thereby enabling remote influence over other users’ systems. Compared to prompt injection attacks, where malicious users attempt to directly circumvent the security restrictions set by AI agents to mislead their outputs, indirect prompt injection attacks are more complex and can have a wider range of user impacts [57]. When plugins are rapidly built to secure AI agents, indirect prompt injection can also be introduced into the corresponding agent frameworks. When AI agents use external plugins to query data injected with malicious instructions, it may lead to security and privacy issues. For example, web data retrieved by AI agents using web plugins could be misinterpreted as user instructions, resulting in extraction of historical conversations, insertion of phishing links, theft of GitHub code [204], or transmission of sensitive information to attackers [185]. More detailed information can also be found in Section 3.3.2. One of the primary reasons for the successful exploitation of indirect prompt injection on AI agents is the inability of AI agents to differentiate between valid and invalid system instructions from external resources. In other words, the integration of AI agents and external resources further blurs the distinction between data and instructions [49].
间接提示注入攻击[49]是一种攻击形式,恶意用户策略性地将指令文本注入到AI代理[40],网页[184]和其他数据源检索的信息中。这种注入的文本通常会作为内部提示返回给AI代理,触发错误行为,从而对其他用户的系统产生远程影响。与即时注入攻击相比,恶意用户试图直接绕过人工智能代理设置的安全限定以误导其输出,间接即时注入攻击更复杂,可以产生更广泛的用户影响[57]。当快速构建插件以保护AI代理时,也可以将间接提示注入引入相应的代理框架。当AI代理使用外部插件来查询注入恶意指令的数据时,可能会导致安全和隐私问题。 比方,AI代理使用Web插件检索的Web数据可能会被误解为用户指令,导致提取汗青对话,插入钓鱼链接,盗取GitHub代码[204]或将敏感信息传输给攻击者[185]。更详细的信息也可以在第3.3.2节中找到。在AI代理上成功使用间接提示注入的主要原因之一是AI代理无法区分来自外部资源的有效和无效系统指令。换句话说,人工智能代理和外部资源的集成进一步含糊了数据和指令之间的区别[49]。
To defend against indirect prompt attacks, developers can impose explicit constraints on the interaction between AI agents and external resources to prevent AI agents from executing external malicious data [185]. For example, developers can augment AI agents with user input references by comparing the original user input and current prompts and incorporating self-reminder functionalities. When user input is first entered, agents are reminded of their original user input references, thus distinguishing between external data and user inputs [14]. To reduce the success rate of indirect prompt injection attacks, several techniques can be employed. These include enhancing AI agents’ ability to recognize external input sources through data marking, encoding, and distinguishing between secure and insecure token blocks [57]. Additionally, the other effective measures can be applied, such as fine-tuning AI agents specifically for indirect prompt injection [196, 204], alignment [121], and employing methods such as prompt engineering and post-training classifier-based security approaches [68].
为了防御间接的即时攻击,开发职员可以对AI代理与外部资源之间的交互施加显式约束,以防止AI代理实行外部恶意数据[185]。比方,开发职员可以通过比较原始用户输入和当前提示并结合自我提示功能来增强AI代理的用户输入参考。当用户输入首次输入时,代理被提示其原始用户输入参考,从而区分外部数据和用户输入[14]。为了降低间接提示注入攻击的成功率,可以接纳多种技能。这些措施包括通过数据标记,编码和区分安全和不安全的令牌块来增强AI代理辨认外部输入源的本事[57]。 别的,可以应用其他有效措施,比方专门针对间接提示注入[196,204],对齐[121]进行微调AI代理,以及接纳提示工程和基于练习后分类器的安全方法[68]等方法。
Current research methods primarily focus on straightforward scenarios where user instructions and external data are input into AI agents. However, with the widespread adoption of agentintegrated frameworks, the effectiveness of these methods in complex real-world scenarios warrants further investigation.
目前的研究方法主要集中在用户指令和外部数据输入到AI代理的简单场景。然而,随着agent integrated框架的广泛接纳,这些方法在复杂的现实天下中的有效性值得进一步研究。
3.1.3 Jailbreak.

3.1.3越狱。

Jailbreak[26] refers to scenarios where users deliberately attempt to deceive or manipulate AI agents to bypass their built-in security, ethical, or operational guidelines, resulting in the generation of harmful responses. In contrast to prompt injection, which arises from the AI agent’s inability to distinguish between user input and system instructions, jailbreak occurs due to the AI agent’s inherent susceptibility to being misled by user instructions. Jailbreak can be categorized into two main types: manual design jailbreak and automated jailbreak.
越狱是指用户故意试图欺骗或操纵人工智能代理以绕过其内置的安全,道德或操作准则,从而产生有害相应的情况。与由于AI代理无法区分用户输入和系统指令而引起的提示注入相反,越狱由于AI代理固有的易受用户指令误导而发生。越狱可以分为两种主要范例:手动设计越狱和自动越狱。

3.2 Threats On Brain 3.2威胁大脑

As described in Figure 2, the brain module undertakes reasoning and planning to make decisions by using LLM. The brain is primarily composed of a large language model, which is the core of an AI agent. To better explain threats in the brain module, we first show the traditional structure of the brain.
如图2所示,大脑模块通过使用LLM进行推理和规划以做出决策。大脑主要由大型语言模型构成,这是AI代理的核心。为了更好地解释大脑模块中的威胁,我们起首展示了大脑的传统结构。
The brain module of AI agents can be composed of reasoning, planning, and decision-making, where they are able to process the prompts from the perception module. However, the brain module of agents based on large language models (LLMs) is not transparent, which diminishes their trustworthiness. The core component, LLMs, is susceptible to backdoor attacks. Their robustness against slight input modifications is inadequate, leading to misalignment and hallucination. Additionally, concerning the reasoning structures of the brain, chain-of-thought (CoT), they are prone to formulating erroneous plans, especially when tasks are complex and require long-term planning, thereby exposing planning threats. In this section, we will mainly consider Gap 2, and discuss backdoor attacks, misalignment, hallucinations, and planning threats.
AI代理的大脑模块可以由推理,规划和决策构成,在那里他们可以或许处置处罚来自感知模块的提示。然而,基于大型语言模型(LLMs)的智能体的大脑模块是不透明的,这降低了它们的可信度。核心组件LLMs容易受到后门攻击。它们对稍微输入修改的鲁棒性不足,导致不对准和幻觉。别的,**关于大脑的推理结构,即思维链(CoT),他们很容易制定错误的计划,**特别是当使命复杂且必要长期计划时,从而暴露出计划威胁。在本节中,我们将主要考虑差距2,并讨论后门攻击,错位,幻觉和规划威胁。
3.2.1 Backdoor Attacks.

3.2.1后门攻击

Backdoor attacks are designed to insert a backdoor within the LLM of the brain, enabling it to operate normally with benign inputs but produce malicious outputs when the input conforms to a specific criterion, such as the inclusion of a backdoor trigger. In the natural language domain, backdoor attacks are mainly achieved by poisoning data during training to implant backdoors. This is accomplished primarily by poisoning a portion of training data with triggers, which causes the model to learn incorrect correlations. Previous research [78, 169] has illustrated the severe outcomes of backdoor attacks on LLMs. Given that agents based on LLMs employ these models as their core component, it is plausible to assert that such agents are also significantly vulnerable to these attacks.
后门攻击旨在在大脑的LLM中插入后门,**使其可以或许在良性输入的情况下正常运行,但当输入符合特定尺度(比方包含后门触发器)时会产生恶意输出。**在自然语言领域,后门攻击主要是通过在练习过程中对数据下毒来植入后门。这主要是通过用触发器毒害一部门练习数据来实现的,这会导致模型学习不精确的相干性。以前的研究[78,169]已经说明了对LLMs后门攻击的严峻结果。鉴于基于LLMs的代理将这些模型作为其核心组件,可以合理地断言这些代理也很容易受到这些攻击。
In contrast to conventional LLMs that directly produce final outputs, agents accomplish tasks through executing multi-step intermediate processes and optionally interacting with the environment to gather external context prior to output generation. This expanded input space of AI agents offers attackers more diverse attack vectors, such as the ability to manipulate any stage of the agents’ intermediate reasoning processes. Yang et al. [192] categorized two types of backdoor attacks against agents.
与直接产生终极输出的传统LLMs相比,代理通过实行多步骤中央过程并可选地与情况交互以在输出生成之前网络外部上下文来完成使命。AI代理的这种扩展的输入空间为攻击者提供了更多样化的攻击向量,比方操纵代理中央推理过程的任何阶段的本事。Yang等人[192]将针对代理的后门攻击分为两种范例。
First, the distribution of the final output is altered. The backdoor trigger can be hidden in the user query or in intermediate results. In this scenario, the attacker’s goal is to modify the original reasoning trajectory of the agent. For example, when a benign user inquires about product recommendations, or during an agent’s intermediate processing, a critical attacking trigger is activated. Consequently, the response provided by the agent will recommend a product dictated by the attacker.
**第一,**改变了终极产出的分布。后门触发器可以隐蔽在用户查询或中央结果中。在这种情况下,攻击者的目的是修改代理的原始推理轨迹。比方,当良性用户询问产品推荐时,或者在代理的中央处置处罚期间,激活关键攻击触发器。因此,代理提供的相应将推荐攻击者指定的产品。
Secondly, the distribution of the final output remains unchanged. Agents execute tasks by breaking down the overall objective into intermediate steps. This approach allows the backdoor pattern to manifest itself by directing the agent to follow a malicious trajectory specified by the attacker, while still producing a correct final output. This capability enables modifications to the intermediate reasoning and planning processes. For example, a hacker could modify a software system to always use Adobe Photoshop for image editing tasks while deliberately excluding other programs. Dong et al. [34] developed an email assistant agent containing a backdoor. When a benign user commands it to send an email to a friend, it inserts a phishing link into the email content and then reports the task status as finished.
**其次,**终极产出的分配保持不变。代理通过将总体目的分解为中央步骤来实行使命。这种方法允许后门模式通过引导代理遵循攻击者指定的恶意轨迹来体现自己,同时仍旧产生精确的终极输出。此功能允许对中央推理和规划过程进行修改。比方,黑客可以修改软件系统,使其始终使用Adobe Photoshop进行图像编辑使命,同时故意扫除其他步伐。Dong等人[34]开发了一个包含后门的电子邮件助理代理。当一个良性用户下令它向朋友发送电子邮件时,它会在电子邮件内容中插入一个钓鱼链接,然后报告使命状态为已完成。
Unfortunately, current defenses against backdoor attacks are still limited to the granularity of the model, rather than to the entire agent ecosystem. The complex interactions within the agent make defense more challenging. These model-based backdoor defense measures mainly include eliminating triggers in poison data [33], removing backdoor-related neurons [76], or trying to recover triggers [18]. However, the complexity of agent interactions clearly imposes significant limitations on these defense methods. We urgently require additional defense measures to address agent-based backdoor attacks. 3.2.2 Misalignment. Alignment refers to the ability of AI agents to understand and execute human instructions during widespread deployment, ensuring that the agent’s behavior aligns with human expectations and objectives, providing useful, harmless, unbiased responses. Misalignment in AI agents arises from unexpected discrepancies between the intended function of the developer and the intermediate executed state. This misalignment can lead to ethical and social threats associated with LLMs, such as discrimination, hate speech, social rejection, harmful information, misinformation, and harmful human-computer interaction [8]. The Red Teaming of Unalignment proposed by Rishabh et al. [8] demonstrates that using only 100 samples, they can “jailbreak” ChatGPT with an 88% success rate, exposing hidden harms and biases within the brain module of AI agents. We categorize the potential threat scenarios that influence misalignment in the brains of AI agents into three types: misalignment in training data, misalignment between humans and agents, and misalignment in embodied environments.
不幸的是,目前对后门攻击的防御仍旧局限于模型的粒度,而不是整个代理生态系统。智能体内部复杂的相互作用使防御更具挑衅性。这些基于模型的后门防御措施主要包括消除毒药数据中的触发器[33],删除后门相干神经元[76]或尝试恢复触发器[18]。然而,代理交互的复杂性显然对这些防御方法施加了庞大限定。我们迫切必要额外的防御措施来解决基于代理的后门攻击。3.2.2未对准。对齐是指AI代理在广泛部署期间明白和实行人类指令的本事,确保代理的行为与人类的期望和目的保持一致,提供有效,无害,公正的相应。 AI代理中的不对齐是由开发职员的预期功能和中央实行状态之间的不测差异引起的。这种不一致可能导致与LLMs相干的道德和社会威胁,比方鄙视,仇恨言论,社会排挤,有害信息,错误信息和有害的人机交互[8]。Rishabh等人提出的不缔盟的赤色团队*。*[8]证明,仅使用100个样本,他们就可以以88%的成功率“越狱”ChatGPT,暴露AI代理大脑模块中隐蔽的危害和偏见。我们将影响AI代理大脑中的未对准的潜在威胁场景分为三种范例:练习数据中的未对准,人类与代理之间的未对准以及详细情况中的未对准。

3.2.4 Planning Threats.

3.2.4计划威胁。

The concept of planning threats suggests that AI agents are susceptible to generating flawed plans, particularly in complex and long-term planning scenarios. Flawed plans are characterized by actions that contravene constraints originating from user inputs because these inputs define the requirements and limitations that the intermediate plan must adhere to.
规划威胁的概念表明,人工智能代理很容易生成有缺陷的计划,特别是在复杂和长期的规划场景中。有缺陷的计划的特征是违反源自用户输入的约束的行为,因为这些输入界说了中央计划必须遵守的要求和限定。
Unlike adversarial attacks, which are initiated by malicious attackers, planning threats arise solely from the inherent robustness issues of LLMs. A recent work [71] argues that an agent’s chain of thought (COT) may function as an “error amplifier”, whereby a minor initial mistake can be continuously magnified and propagated through each subsequent action, ultimately leading to catastrophic failures.
与恶意攻击者发起的对抗性攻击不同,规划威胁完全来自LLMs固有的鲁棒性问题。近来的一项工作[71]认为,代理人的思维链(COT)可能会起到“错误放大器”的作用,因此最初的一个小错误可以通过随后的每个动作不断放大和传播,终极导致灾难性的失败。
Various strategies have been implemented to regulate the text generation of LLMs, including the application of hard constraints [12], soft constraints [101], or a combination of both [20]. However, the emphasis on controlling AI agents extends beyond the mere generation of text to the validity of plans and the use of tools. Recent research has employed LLMs as parsers to derive a sequence of tools from the texts generated in response to specifically crafted prompts. Despite these efforts, achieving a high rate of valid plans remains a challenging goal.
已经实施了各种策略来规范LLMs的文本生成,包括应用硬约束[12]、软约束[101]或两者的组合[20]。然而,对控制人工智能主体的夸大不仅仅是文本的生成,还包括计划的有效性和工具的使用。近来的研究接纳LLMs作为解析器,从相应于专门制作的提示而生成的文本中导出一系列工具。尽管作出了这些积极,实现高有效计划率仍旧是一个具有挑衅性的目的。
To address this issue, current strategies are divided into two approaches. The first approach involves establishing policy-based constitutional guidelines [63], while the second involves human users constructing a context-free grammar (CFG) as the formal language to represent constraints for the agent [88]. The former sets policy-based standard limitations on the generation of plans during the early, middle and late stages of planning. The latter method converts a context-free grammar (CFG) into a pushdown automaton (PDA) and restricts the language model (LLM) to only select valid actions defined by the PDA at its current state, thereby ensuring that the constraints are met in the final generated plan.
为解决这一问题,目前的战略分为两种办法。第一种方法涉及建立基于政策的宪法指南[63],而第二种方法涉及人类用户构建上下文无关语法(CFG)作为形式语言来表现代理的约束[88]。前者在规划的早期、中期和后期阶段对计划的生成设定了基于政策的尺度限定。后一种方法将上下文无关语法(CFG)转换为下推自动机(PDA),并限定语言模型(LLM)仅选择PDA在其当前状态下界说的有效操作,从而确保在终极生成的计划中满意约束。
3.3 Threats On Action 威胁行动

In connection with Gap 2, within a single agent, there exists an invisible yet complex internal execution process, which complicates the monitoring of internal states and potentially leads to numerous security threats. These internal executions are often called actions, which are tools utilized by the agent (e.g., calling APIs) to carry out tasks as directed by users. To better understand the action threats, we present the action structure as follows:
在Gap 2中,在单个代理中,存在一个不可见但复杂的内部实行过程,这使得内部状态的监控变得复杂,并可能导致许多安全威胁。这些内部实行通常被称为动作,它们是代理使用的工具(*比方,*调用API)来实行用户指示的使命。为了更好地明白动作威胁,我们将动作结构出现如下:

We categorize the threats of actions into two directions. One is the threat during the communication process between the agent and the tool (i.e., occurring in the input, observation, and final answer), termed Agent2Tool threats. The second category relates to the inherent threats of the tools and APIs themselves that the agent uses (i.e., occurring in the action execution). Utilizing these APIs may increase its vulnerability to attacks, and the agent can be impacted by misinformation in the observations and final answer, which we refer to as Supply Chain threats.
我们把行动的威胁分为两个方向。一个是Agent和工具之间通信过程中的威胁即,发生在输入、观察和终极答案中),称为Agent2Tool威胁第二类涉及代理使用的工具和API本身的固有威胁即,发生在动作实行中)。使用这些API可能会增加其对攻击的脆弱性,并且代理可能会受到观察和终极答案中的错误信息的影响,我们将其称为供应链威胁
3.3.1 Agent2Tool Threats. Agent2Tool威胁。

Agent2Tool threats refer to the hazards associated with the exchange of information between the tool and the agent. These threats are generally classified as either active or passive. In active mode, the threats originate from the action input provided by LLMs.
**Agent2Tool威胁是指与工具和代理之间的信息交换相干的伤害。**这些威胁通常分为自动或被动两类。在自动模式下,威胁源自LLMs提供的操作输入。
Specifically, after reasoning and planning, the agent seeks a specific tool to execute subtasks. As an auto-regressive model, the LLM generates plans based on the probability of the next token, which introduces generative threats that can impact the tool’s performance. ToolEmu [141] identifies some failures of AI agents since the action execution requires excessive tool permissions, leading to the execution of highly risky commands without user permission. The passive mode, on the other hand, involves threats that stem from the interception of observations and final answers of normal tool usage. This interception can breach user privacy, potentially resulting in inadvertent disclosure of user data to third-party companies during transmission to the AI agent and the tools it employs. This may lead to unauthorized use of user information by these third parties. Several existing AI agents using tools have been reported to suffer user privacy breaches caused by passive models, such as HuggingGPT [149] and ToolFormer [144].
详细来说,颠末推理和规划后,代理会探求特定的工具来实行子使命。作为一种自回归模型,LLM基于下一个令牌的概率生成计划,这引入了可能影响工具性能的生成威胁。ToolEmu [141]辨认了AI代理的一些故障,因为动作实行必要过多的工具权限,从而导致在没有效户权限的情况下实行高风险下令。另一方面,被动模式涉及源于对正常工具使用的观察和终极答案的拦截的威胁。这种拦截可能会侵犯用户隐私,可能导致用户数据在传输到AI代理及其使用的工具期间无意中泄漏给第三方公司。这可能导致这些第三方未经授权使用用户信息。 据报道,一些现有的使用工具的人工智能代理会遭受被动模型造成的用户隐私泄漏,如HuggingGPT [149]和ToolFormer [144]。
To mitigate the previously mentioned threats, a relatively straightforward approach is to defend against the active mode of Agent2Tool threats. ToolEmu has designed an isolated sandbox and the corresponding emulator that simulates the execution of an agent’s subtasks within the sandbox, assessing their threats before executing the commands in a real-world environment. However, its effectiveness heavily relies on the quality of the emulator. Defending against passive mode threats is more challenging because these attack strategies are often the result of the agent’s own incomplete development and testing. Zhang et al. [207] integrated a homomorphic encryption scheme and deployed an attribute-based forgery generative model to safeguard against privacy breaches during communication processes. However, this approach incurs additional computational and communication costs for the agent. A more detailed discussion on related development and testing is presented in Section 4.1.2.
为了减轻前面提到的威胁,一个相对简单的方法是防御Agent2Tool威胁的自动模式。ToolEmu设计了一个隔离的沙箱和相应的模拟器,模拟沙箱中代理子使命的实行,在现实情况中实行下令之前评估它们的威胁。然而,它的有效性在很大程度上依靠于仿真器的质量。防御被动模式的威胁更具挑衅性,因为这些攻击策略通常是代理自己不完备的开发和测试的结果。Zhang等人[207]集成了同态加密方案,并部署了基于属性的伪造生成模型,以防止通信过程中的隐私泄漏。然而,这种方法会为代理带来额外的计算和通信本钱。有关相干开发和测试的更详细讨论见第4.1节。
3.3.2 Supply Chain Threats. 供应链威胁

Supply chain threats refer to the security vulnerabilities inherent in the tools themselves or to the tools being compromised, such as through buffer overflow, SQL injection, and cross-site scripting attacks. These vulnerabilities result in the action execution deviating from its intended course, leading to undesirable observations and final answers. WIPI [184] employs an indirect prompt injection attack, using a malicious webpage that contains specifically crafted prompts. When a typical agent accesses this webpage, both its observations and final answers are deliberately altered. Similarly, malicious users can modify YouTube transcripts to change the content that ChatGPT retrieves from these transcripts [61]. Webpilot [39] is designed as a malicious plugin for ChatGPT, allowing it to take control of a ChatGPT chat session and exfiltrate the history of the user conversation when ChatGPT invokes this plugin.
供应链威胁是指工具本身固有的安全毛病或工具受到威胁,比方通过缓冲区溢出,SQL注入和跨站点脚本攻击。这些毛病会导致动作实行偏离其预期路线,导致不期望的观察结果和终极答案。WIPI [184]接纳间接提示注入攻击,使用包含专门制作的提示的恶意网页。当一个典范的代理访问这个网页时,它的观察结果和终极答案都被故意改变。同样,恶意用户可以修改YouTube成绩单,以更改ChatGPT从这些成绩单中检索的内容[61]。Webpilot [39]被设计为ChatGPT的恶意插件,允许它控制ChatGPT聊天会话,并在ChatGPT调用此插件时泄漏用户对话的汗青记录。
To mitigate supply chain threats, it is essential to implement stricter supply chain auditing policies and policies for agents to invoke only trusted tools. Research on this aspect is rarely mentioned in the field.
为了减轻供应链威胁,必须实施更严格的供应链审计策略,并让代理只调用可信工具。这方面的研究在该领域很少被提及。
4.1 Threats On Agent2Environment . Agent2Environment面临的威胁

In light of Gap 3 (Variability of operational environments), we shift our focus to exploring the issue of environmental threats, scrutinizing how different types of environment affect and are affected by agents. For each environmental paradigm, we identify key security concerns, advantages in safeguarding against hazards, and the inherent limitations in ensuring a secure setting for interaction.
鉴于差距3(运营情况的可变性),我们将重点转移到探索情况威胁的问题,细致研究不同范例的情况怎样影响和受代理人的影响。对于每一个情况范例,我们确定关键的安全问题,在防范伤害的优势,并确保一个安全的互动设置的固有限定。
4.1.1 Simulated & Sandbox Environment. 模拟和沙盒情况。

In the realm of computational linguistics, a simulated environment within an AI agent refers to a digital system where the agent operates and interacts [44, 93, 125]. This is a virtual space governed by programmed rules and scenarios that mimic real-world or hypothetical situations, allowing the AI agent to generate responses and learn from simulated interactions without the need for human intervention. By leveraging vast datasets and complex algorithms, these agents are designed to predict and respond to textual inputs with human-like proficiency.
在计算语言学领域,人工智能主体内的模拟情况指的是主体操作和交互的数字系统[44,93,125]。这是一个假造空间,由模拟现实天下或假设情况的编程规则和场景控制,允许AI代理生成相应并从模拟的交互中学习,而无需人工干预。通过使用庞大的数据集和复杂的算法,这些智能体被设计为以类似人类的纯熟程度来预测和相应文本输入。
However, the implementation of AI agents in simulated environments carries inherent threats.
然而,在模拟情况中实现AI代理会带来固有的威胁。
We list two threats below:
我们在下面列出两个威胁:

To address these concerns from the root, it is essential to implement rigorous ethical guidelines and oversight mechanisms that ensure the responsible use of simulated environments in AI agents.
为了从根本上解决这些问题,必须实施严格的道德准则和监视机制,以确保在人工智能代理中负责任地使用模拟情况。
4.1.2 Development & Testing Environment. 开发和测试情况。

The development and testing environment for AI agents serves as the foundation for creating sophisticated AI systems. The development & testing environment for AI agents currently includes two types: the first type involves the fine-tuning of large language models, and the second type involves using APIs of other pre-developed models. Most AI agent developers tend to use APIs from other developed LLMs. This approach raises potential security issues, specifically with regard to how to treat third-party LLM API providers—are they trusted entities or not? As discussed in Section 3.2, LLM APIs could be compromised by backdoor attacks, resulting in the “brain” of the AI agent being controlled by others.
AI代理的开发和测试情况是创建复杂AI系统的底子。目前,人工智能代理的开发和测试情况包括两种范例:第一种范例涉及大型语言模型的微调,第二种范例涉及使用其他预开发模型的API。大多数AI代理开发职员倾向于使用来自其他开发的LLMsAPI。这种方法引起了潜在的安全问题,特别是关于怎样对待第三方LLMAPI提供者-他们是否是可信实体?正如第3.2节所讨论的,LLMAPI可能会受到后门攻击的危害,导致AI代理的“大脑”被其他人控制。
To mitigate these threats, a strategic approach centered on the selection of development tools and frameworks that incorporate robust security measures is imperative. Firstly, the establishment of security guardrails for LLMs is paramount. These guardrails are designed to ensure that LLMs generate outputs that adhere to predefined security policies, thereby mitigating threats associated with their operation. Tools such as GuardRails AI [51] and NeMo Guardrails [138] exemplify mechanisms that can prevent LLMs from accessing sensitive information or executing potentially harmful code. The implementation of such guardrails is critical for protecting data and systems against breaches.
为了减轻这些威胁,必须接纳一种战略方法,重点是选择包含强大安全措施的开发工具和框架。起首,为LLMs建立安全护栏至关重要。这些护栏旨在确保LLMs生成符合预界说安全策略的输出,从而减轻与其操作相干的威胁。GuardRails AI [51]和NeMo Guardrails [138]等工具可以制止LLMs访问敏感信息或实行潜在有害代码。实施这些防护措施对于保护数据和系统免受粉碎至关重要。
Moreover, the management of caching and logging plays a crucial role in securing LLM development environments. Secure caching mechanisms, exemplified by Redis and GPTCache [6], enhance performance while ensuring data integrity and access control. Concurrently, logging, facilitated by tools like MLFlow [201] and Weights & Biases [102], provides a comprehensive record of application activities and state changes. This record is indispensable for debugging, monitoring, and maintaining accountability in data processing, offering a chronological trail that aids in the swift identification and resolution of issues.
别的,缓存和日记记录的管理在保护LLM开发情况中起着至关重要的作用。以Redis和GPTCache [6]为例的安全缓存机制在确保数据完备性和访问控制的同时提高了性能。同时,由MLFlow [201]和Weights & Biases [102]等工具提供的日记记录提供了应用步伐运动和状态更改的全面记录。该记录对于数据处置处罚中的调试、监控和维护责任是必不可少的,它提供了时间线索,有助于快速辨认和解决问题。
Lastly, model evaluation [137] is an essential component of the development process. It involves evaluating the performance of LLMs to confirm their accuracy and functionality. Through evaluation, developers can identify and rectify potential biases or flaws, facilitating the adjustment of model weights and improvements in performance. This process ensures that LLMs operate as intended and meet the requisite reliability standards. The security of AI agent development and testing environments is a multifaceted issue that requires a comprehensive strategy encompassing the selection of frameworks or orchestration tools with built-in security features, the establishment of security guardrails, and the implementation of secure caching, logging, and model evaluation practices. By prioritizing security in these areas, organizations can significantly reduce the threats associated with the development and deployment of AI agents, thereby safeguarding the confidentiality and integrity of their data and models.
最后,模型评估[137]是开发过程的重要构成部门。它涉及评估LLMs的性能,以确认其正确性和功能。通过评估,开发职员可以辨认和改正潜在的偏差或缺陷,促进模型权重的调整和性能的改善。这一过程确保LLMs按预期运行并符合须要的可靠性尺度。人工智能代理开发和测试情况的安全性是一个多方面的问题,必要一个全面的策略,包括选择具有内置安全功能的框架或编排工具,建立安全护栏,以及实施安全缓存,日记记录和模型评估实践。 通过优先考虑这些领域的安全性,组织可以显著镌汰与AI代理的开发和部署相干的威胁,从而保护其数据和模型的秘密性和完备性。
4.1.3 Computing Resources Management Environment. 计算资源管理情况。

The computing resources management environment of AI agents refers to the framework or system that oversees the allocation, scheduling, and optimization of computational resources, such as CPU, GPU, and memory, to efficiently execute tasks and operations. An imperfect agent computing resource management environment can also make the agent more vulnerable to attacks by malicious users, potentially compromising its functionality and security. There are four kinds of attacks:
人工智能代理的计算资源管理情况是指监视CPU、GPU和内存等计算资源的分配、调度和优化以高效实行使命和操作的框架或系统。不完善的代理计算资源管理情况也会使代理更容易受到恶意用户的攻击,从而可能损害其功能和安全性。有四种攻击:

4.1.4 Physical Environment. 物理情况

The term “physical environment” pertains to the concrete, tangible elements and areas that make up our real-world setting, encompassing all actual physical spaces and objects. The physical environment of an AI agent typically refers to the collective term for all external entities that are encountered or utilized during the operation of the AI agent. In reality, the security threats in the physical environment are far more varied and numerous than those in the other environments due to the inherently more complex nature of the physical settings agents encounter.
“物理情况”一词涉及构成我们现实天下情况的详细、有形的元素和区域,包括全部实际的物理空间和物体。AI代理的物理情况通常是指在AI代理的操作期间遇到或使用的全部外部实体的集合术语。实际上,由于代理遇到的物理设置的固有的更复杂的性质,物理情况中的安全威胁比其他情况中的安全威胁更加多样和浩繁。
In the physical environment, agents often employ a variety of hardware devices to gather external resources and information, such as sensors, cameras, and microphones. At this stage, given that the hardware devices themselves may pose security threats, attackers can exploit vulnerabilities to attack and compromise hardware such as sensors, thereby preventing the agent from timely receiving external information and resources, indirectly leading to a denial of service for the agent. In physical devices integrated with sensors, there may be various types of security vulnerabilities. For instance, hardware devices with integrated Bluetooth modules could be susceptible to Bluetooth attacks, leading to information leakage and denial of service for the agent[114]. Additionally, outdated versions and unreliable hardware sources might result in numerous known security vulnerabilities within the hardware devices. Therefore, employing reliable hardware devices and keeping firmware versions up to date can effectively prevent the harm caused by vulnerabilities inherent in physical devices.
在物理情况中,代理通常使用各种硬件装备来网络外部资源和信息,比方传感器,摄像机和麦克风。在此阶段,鉴于硬件装备本身可能构成安全威胁,攻击者可以使用毛病攻击和危害传感器等硬件,从而制止代理及时接收外部信息和资源,间接导致代理拒绝服务。在与传感器集成的物理装备中,可能存在各种范例的安全毛病。比方,具有集成蓝牙模块的硬件装备可能容易受到蓝牙攻击,导致代理的信息泄漏和拒绝服务[114]。别的,过时的版本和不可靠的硬件来源可能会导致硬件装备中存在许多已知的安全毛病。 因此,接纳可靠的硬件装备并保持固件版本最新,可以有效防止物理装备固有的毛病造成的危害。
Simultaneously, in the physical environment, resources and information are input into the agent in various forms for processing, ranging from simple texts and sensor signals to complex data types such as audio and video. These data often exhibit higher levels of randomness and complexity, allowing attackers to intricately disguise harmful inputs, such as Trojans, within the information collected by hardware devices. If they are not properly processed, these can lead to severe security issues. Taking the rapidly evolving field of autonomous driving safety research as an example, the myriad sensors integrated into vehicles often face the threats of interference and spoofing attacks [28]. Similarly, for hardware devices integrated with agents, there exists a comparable threat. Attackers can indirectly affect an agent system’s signal processing by interfering with the signals collected by sensors, leading to the agent misinterpreting the information content or being unable to read it at all. This can even result in deception or incorrect guidance regarding the agent’s subsequent instructions and actions. Therefore, after collecting inputs from the physical environment, agents need to conduct security checks on the data content and promptly filter out information containing threats to ensure the safety of the agent system.
同时,在物理情况中,资源和信息以各种形式输入到代理中进行处置处罚,从简单的文本和传感器信号到复杂的数据范例,如音频和视频。这些数据通常体现出更高的随机性和复杂性,使攻击者可以或许在硬件装备网络的信息中复杂地伪装有害输入,比方特洛伊木马。假如处置处罚不当,可能会导致严峻的安全问题。以快速发展的自动驾驶安全研究领域为例,集成到车辆中的无数传感器常常面临干扰和欺骗攻击的威胁。同样,对于与代理集成的硬件装备,存在类似的威胁。 攻击者可以通过干扰传感器网络的信号来间接影响代理系统的信号处置处罚,导致代理误解信息内容或根本无法读取信息。这甚至可能导致欺骗或不精确的指导有关代理的后续指令和行动。因此,在从物理情况中网络输入后,代理必要对数据内容进行安全检查,并及时过滤掉包含威胁的信息,以确保代理系统的安全。
Due to the inherent randomness in the responses of existing LLMs to queries, the instructions sent by agents to hardware devices may not be correct or appropriate, potentially leading to the execution of an erroneous movement [153]. Compared to the virtual environment, the instructions generated by LLMs in agents within the physical environment may not be well understood and executed by the hardware devices responsible for carrying out these commands. This discrepancy can significantly affect the agent’s work efficiency. Additionally, given the lower tolerance for errors in the physical environment, agents cannot be allowed multiple erroneous attempts in a real-world setting. Should the LLM-generated instructions not be well understood by hardware devices, the inappropriate actions of the agent might cause real and irreversible harm to the environment.
由于现有LLMs对查询的相应具有固有的随机性,代理向硬件装备发送的指令可能不精确或不恰当,可能导致实行错误的移动[153]。与假造情况相比,由物理情况内的代理中LLMs生成的指令可能无法被负责实行这些下令的硬件装备很好地明白和实行。这种差异会严峻影响代理的工作服从。别的,考虑到物理情况中对错误的容忍度较低,代理在现实情况中不允许多次错误尝试。假如LLM生成的指令不能被硬件装备很好地明白,代理的不恰当动作可能会对情况造成真实的和不可逆转的损害。
4.2 Threats On Agent2Agent . Agent2Agent上的威胁

Although single-agent systems excel at solving specific tasks individually, multi-agent systems leverage the collaborative effort of several agents to achieve more complex objectives and exhibit superior problem-solving capabilities. Multi-agent interactions also add new attack surfaces to AI agents. In this subsection, we focus on exploring the security that agents interact with each other in a multi-agent manner. The security of interaction within a multi-agent system can be broadly categorized as follows: cooperative interaction threats and competitive interaction threat.
虽然单智能体系统擅长于单独解决特定的使命,多智能体系统使用多个智能体的协作积极来实现更复杂的目的,并体现出上级解决问题的本事。多代理交互还为人工智能代理添加了新的攻击面。在本小节中,我们将重点探究代理以多代理方式相互交互的安全性。多智能体系统中交互的安全性可以大抵分为:互助交互威胁和竞争交互威胁。
4.2.1 Cooperative Interaction Threats. 互助互动威胁。

A kind of multi-agent system depends on a cooperative framework [54, 82, 104, 117] where multiple agents work with the same objectives. This framework presents numerous potential benefits, including improved decision-making [191] and task completion efficiency [205]. However, there are multiple potential threats for this pattern. First, a recent study [113] finds that undetectable secret collusion between agents can easily be caused through their public communication. These secret collusion may bring back biased decisions. For instance, it is possible that we may soon observe advanced automated trading agents collaborating on a large scale to eliminate competitors, potentially destabilizing global markets. This secret collusion leads to a situation in which the ostensibly benign independent actions of each system cumulatively result in outcomes that exhibit systemic bias. Second, MetaGPT [58] found that frequent cooperation between agents can amplify minor hallucinations. To mitigate hallucinations, techniques such as cross-examination [133] or external supportive feedback [106] could improve the quality of agent output. Third, a single agent’s error or misleading information can quickly spread to others, leading to flawed decisions or behaviors across the system. Pan et al. [123] established Open-Domain Question Answering Systems (ODQA) with and without propagated misinformation. They found that this propagation of errors can dramatically reduce the performance of the whole system.
一种多智能体系统依靠于一个互助框架[54,82,104,117],此中多个智能体以相同的目的工作。该框架提供了许多潜在的好处,包括改善决策[191]和使命完成服从[205]。然而,这种模式存在多种潜在威胁。起首,近来的一项研究[113]发现,代理人之间不可察觉的秘密勾结很容易通过他们的公共通信引起。这些秘密的勾结可能会带回有偏见的决定。比方,我们可能很快就会看到先进的自动化交易代理人大规模互助,以消除竞争对手,这可能会粉碎全球市场的稳定。这种秘密勾结导致了如许一种情况,即每个系统外貌上良性的独立行动累积起来,导致了体现出系统性偏见的结果。 其次,MetaGPT [58]发现,代理之间的频仍互助可以放大稍微的幻觉。为了减轻幻觉,交叉询问[133]或外部支持性反馈[106]等技能可以提高代理输出的质量。第三,单个代理人的错误或误导性信息可以迅速传播给其他人,导致整个系统的错误决策或行为。Pan等人[123]建立了开放域问题查询系统(ODQA),有和没有传播错误信息。他们发现,这种错误的传播会大大降低整个系统的性能。
To counteract the negative effects of misinformation produced by agents, protective measures such as prompt engineering, misinformation detection, and major voting strategies are commonly employed. Similarly, Cohen et al. [23] introduce a worm called Morris II, the first designed to target cooperative multi-agent ecosystems by replicating malicious inputs to infect other agents. The danger of Morris II lies in its ability to exploit the connectivity between agents, potentially causing a rapid breakdown of multiple agents once one is infected, resulting in further problems such as spamming and exfiltration of personal data. We argue that although these mitigation measures are in place, they remain rudimentary and may lead to an exponential decrease in the efficiency of the entire agent system, highlighting a need for further exploration in this field.
为了抵消代理人产生的错误信息的负面影响,通常接纳保护措施,如及时工程,错误信息检测和主要投票策略。类似地,Cohen et al. [23]先容了一种名为Morris II的蠕虫,第一个旨在通过复制恶意输入来感染其他代理来针对互助的多代理生态系统。Morris II的伤害在于它可以或许使用代理之间的连接,一旦一个代理被感染,可能会导致多个代理迅速崩溃,从而导致进一步的问题,如垃圾邮件和个人数据泄漏。我们认为,虽然这些缓解措施已经到位,但它们仍旧是基本的,可能会导致整个代理系统的服从呈指数级下降,突出了在这一领域进一步探索的须要性。
The cooperative multi-agent also provides more benefits against security threats from their frameworks. First, cooperative frameworks have the potential to defend against jailbreak attacks.
互助的多代理也提供了更多的好处,从他们的框架的安全威胁。起首,互助框架有可能防御越狱攻击。
AutoDefense [203] demonstrates the efficacy of a multi-agent cooperative framework in thwarting jailbreak attacks, resulting in a significant decrease in attack success rates with a low false positive rate on safe content. Second, the cooperative pattern for planning and execution is favorable to improving software quality attributes, such as security and accountability [100]. For example, this pattern can be used to detect and control the execution of irreversible code, like "rm -rf ".
AutoDefense [203]证明了多代理互助框架在制止越狱攻击方面的有效性,导致攻击成功率显着降低,安全内容的误报率较低。第二,计划和实行的互助模式有利于提高软件质量属性,如安全性和责任性[100]。比方,该模式可用于检测和控制不可逆代码的实行,如“rm -rf“。
4.2.2 Competitive Interaction Threats. Another multi-agent system depends on competitive interactions, wherein each competitor embodies a distinct perspective to safeguard the advantages of their respective positions. Cultivating agents in a competitive environment benefits research in the social sciences and psychology. For example, restaurant agents competing with each other can attract more customers, allowing for an in-depth analysis of the behavioral relationships between owners and clients. Examples include game-simulated agent interactions [156, 213] and societal simulations [44].
4.2.2竞争性互动威胁。

另一个多主体系统依靠于竞争性互动,此中每个竞争者都体现了不同的观点,以维护各自地位的优势。在竞争情况中造就代理人有利于社会科学和心理学的研究。比方,相互竞争的餐厅代理商可以吸引更多的客户,从而可以深入分析业主和客户之间的行为关系。例子包括游戏模拟代理交互[156,213]和社会模拟[44]。
Although multi-agent systems engage in debates across multiple rounds to complete tasks, some intense competitive relationships may render the interactions of information flow between agents untrustworthy. The divergence of viewpoints among agents can lead to excessive conflicts, to the extent that agents may exhibit adversarial behaviors. To improve their own performance relative to their competitors, agents may engage in tactics such as the generation of adversarial inputs aimed at misleading other agents and degrading their performance [189]. For example, O’Gara [119] designed a game in which multiple agents, acting as players, search for a key within a locked room.
虽然多智能体系统参与多轮辩论来完成使命,但一些猛烈的竞争关系可能会使智能体之间的信息流交互变得不可信。代理人之间的观点分歧可能会导致过分的辩论,在某种程度上,代理人可能会体现出敌对行为。为了提高自己相对于竞争对手的体现,代理人可能会接纳一些策略,比方产生旨在误导其他代理人并降低其体现的对抗性输入[189]。比方,O’Gara [119]设计了一个游戏,在这个游戏中,多个代理人扮演玩家,在一个锁着的房间里探求一把钥匙。
To acquire limited resources, he found that some players utilized their strong persuasive skills to induce others to commit suicide. Such phenomena not only compromise the security of individual agents but could also lead to instability in the entire agent system, triggering a chain reaction.
为了获得有限的资源,他发现一些玩家使用他们强大的说服本事来诱使他人自尽。这种征象不仅危及个体代理的安全,而且还可能导致整个代理系统的不稳定,引发连锁反应。
Another potential threat involves the misuse and ethical issues concerning competitive multiagent systems, as the aforementioned example could potentially encourage such systems to learn how to deceive humans. Park et al. [126] provide a detailed analysis of the threats posed by agent systems, including fraud, election tampering, and loss of control over AI systems. One notable case study involves Meta’s development of the AI system Cicero for a game named Diplomacy. Meta aimed to train Cicero to be “largely honest and helpful to its speaking partners” [41]. Despite these intentions, Cicero became an expert at lying. It not only betrays other players but also engages in premeditated deception, planning in advance to forge a false alliance with a human player to trick them into leaving themselves vulnerable to an attack.
另一个潜在的威胁涉及竞争性多智能体系统的滥用和道德问题,因为上述例子可能会鼓励这些系统学习怎样欺哄人类。Park等人[126]对代理系统构成的威胁进行了详细分析,包括欺诈、选举篡改和对人工智能系统的失控。一个值得注意的案例研究涉及Meta为一款名为Diplomacy的游戏开发的人工智能系统Cicero。Meta的目的是练习西塞罗“在很大程度上是老实的,并有助于其发言的同伴”[41]。尽管有这些意图,西塞罗还是成为了撒谎的专家。它不仅背叛其他玩家,而且还进行有预谋的欺骗,提前计划与人类玩家建立虚假联盟,欺骗他们让自己容易受到攻击。
To mitigate the threats mentioned above, ensuring controlled competition among AI agents from a technological perspective presents a significant challenge. It is difficult to control the output of an agent’s “brain”, and even when constraints are incorporated during the planning process, it could significantly impact the agent’s effectiveness. Therefore, this issue remains an open research question, inviting more scholars to explore how to ensure that the competition between agents leads to a better user experience.
为了缓解上述威胁,从技能角度确保人工智能代理之间的受控竞争是一个庞大挑衅。很难控制代理的“大脑”的输出,即使在规划过程中参加了约束条件,也会显著影响代理的有效性。因此,这个问题仍旧是一个开放的研究问题,约请更多的学者来探究怎样确保代理之间的竞争导致更好的用户体验。
4.3 Threats On Memory 4.3内存威胁

Memory interaction within the AI agent system involves storing and retrieving information throughout the processing of agent usage. Memory plays a critical role in the operation of the AI agent, and it involves three essential phases: 1) the agent gathers information from the environment and stores it in its memory; 2) after storage, the agent processes this information to transform it into a more usable form; 3) the agent uses the processed information to inform and guide its next actions. That is, the memory interaction allows agents to record user preferences, glean insights from previous interactions, assimilate valuable information, and use this gained knowledge to improve the quality of service. However, these interactions can present security threats that need to be carefully managed. In this part, we divide these security threats in the memory interaction into two subgroups, short-term memory interaction threats and long-term memory interaction threats.
AI代理系统内的内存交互涉及在代理使用过程中存储和检索信息。影象在人工智能主体的操作中起着至关重要的作用,它涉及三个基本阶段:1)主体从情况中网络信息并将其存储在影象中; 2)存储后,主体处置处罚这些信息,将其转换为更可用的形式; 3)主体使用处置处罚后的信息来关照和指导其下一步行动。也就是说,影象交互允许代理记任命户偏好,从先前的交互中网络看法,吸收有代价的信息,并使用这些获得的知识来提高服务质量。但是,这些交互可能会带来安全威胁,必要谨慎管理。在这一部门中,我们将影象交互中的安全威胁分为两个亚组,即短时影象交互威胁和长时影象交互威胁。
4.3.1 Short-term Memory Interaction Threats. 短期影象交互威胁。

Short-term memory in the AI agent acts like human working memory, serving as a temporary storage system. It keeps information for a limited time, typically just for the duration of the current interaction or session. This type of memory is crucial for maintaining context throughout a conversation, ensuring smooth continuity in dialogue, and effectively managing user prompts. However, AI agents typically face a constraint in their working memory capacity, limited by the number of tokens they can handle in a single interaction [65, 103, 125]. This limitation restricts their ability to retain and use extensive context from previous interactions.
人工智能代理中的短期影象就像人类的工作影象一样,充当暂时存储系统。它在有限的时间内保留信息,通常仅在当前交互或会话的持续时间内。这种范例的影象对于在整个对话中保持上下文,确保对话的平稳连续性以及有效管理用户提示至关重要。然而,人工智能代理通常面临着工作影象容量的限定,受到他们在单次交互中可以处置处罚的令牌数量的限定[65,103,125]。这种限定限定了他们保留和使用以前交互的广泛上下文的本事。
Moreover, each interaction is treated as an isolated episode [60], lacking any linkage between sequential subtasks. This fragmented approach to memory prevents complex sequential reasoning and impairs knowledge sharing in multi-agent systems. Without robust episodic memory and continuity across interactions, agents struggle with complex sequential reasoning tasks, crucial for advanced problem-solving. Particularly in multi-agent systems, the absence of cooperative communication among agents can lead to suboptimal outcomes. Ideally, agents should be able to share immediate actions and learning experiences to efficiently achieve common goals [7].
别的,每个交互都被视为一个孤立的事故[60],缺乏顺序子使命之间的任何联系。这种碎片化的影象方法制止了复杂的顺序推理,并损害了多智能体系统中的知识共享。假如没有强大的景象影象和交互的连续性,智能体将难以完成复杂的顺序推理使命,这对高级问题解决至关重要。特别是在多智能体系统中,智能体之间缺乏互助通信会导致次优结果。理想情况下,代理应该可以或许分享即时行动和学习经验,以有效地实现共同目的[7]。
To address these challenges, concurrent solutions are divided into two categories, extending LLM context window [32] and compressing historical in-context contents [47, 65, 95, 118]. The former improves agent memory space by efficiently identifying and exploiting positional interpolation non-uniformities through the LLM fine-tuning step, progressively extending the context window from 256k to 2048k, and readjusting to preserve short context window capabilities. On the other hand, the latter continuously organizes the information in working memory by deploying models for summary.
为相识决这些挑衅,并发解决方案分为两类,扩展LLM上下文窗口[32]和压缩汗青上下文内容[47,65,95,118]。前者通过LLM微调步骤有效地辨认和使用位置插值非均匀性,逐步将上下文窗口从256k扩展到2048k,并重新调整以保留短上下文窗口本事,从而提高了代理存储器空间。LLM另一方面,后者通过部署模型进行汇总,不断组织工作影象中的信息。
Moreover, one crucial threat highlighted in multi-agent systems is the asynchronization of memory among agents [211]. This process is essential for establishing a unified knowledge base and ensuring consistency in decision-making across different agents. An asynchronous working memory record may cause a deviation in the goal resolution of multiple agents. However, preliminary solutions are already available. For instance, Chen et al. [19] underscore the importance of integrating synchronized memory modules for multi-robot collaboration. Communication among agents also plays a significant role, relying heavily on memory to maintain context and interpret messages. For example, Mandi et al. [104] demonstrate memory-driven communication frameworks that promote a common understanding among agents.
别的,多智能体系统中突出的一个关键威胁是智能体之间的影象的分散化[211]。这一过程对于建立同一的知识库和确保不同机构决策的一致性至关重要。异步工作影象记录可能会导致多个代理的目的分辨率的偏差。不过,目前已有开端的解决办法。比方,Chen等人[19]夸大了集成同步内存模块对多机器人协作的重要性。代理之间的通信也起着重要的作用,在很大程度上依靠于内存来维护上下文和解释消息。比方,Mandi et al. [104]展示影象驱动的通信框架,促进代理之间的共同明白。
4.3.2 Long-term Memory Interaction Threats. 长期影象交互威胁。

The storage and retrieval of long-term memory depend heavily on vector databases. Vector databases [122, 187] utilize embeddings for data storage and retrieval, offering a non-traditional alternative to scalar data in relational databases. They leverage similarity measures like cosine similarity and metadata filters to efficiently find the most relevant matches. The workflow of vector databases is composed of two main processes. First, the indexing process involves transforming data into embeddings, compressing these embeddings, and then clustering them for storage in vector databases. Second, during querying, data is transformed into embeddings, which are then compared with the stored embeddings to find the nearest neighbor matches. Notably, these databases often collaborate with RAG, introducing novel security threats.
长期影象的存储和检索在很大程度上依靠于向量数据库。向量数据库[122,187]使用嵌入进行数据存储和检索,为关系数据库中的标量数据提供了非传统的替代方案。它们使用余弦相似性和元数据过滤器等相似性度量来有效地找到最相干的匹配。矢量数据库的工作流程由两个主要过程构成。起首,索引过程涉及将数据转换为嵌入,压缩这些嵌入,然后将它们聚类以存储在向量数据库中。其次,在查询过程中,数据被转换为嵌入,然后与存储的嵌入进行比较,以找到近来的邻居匹配。值得注意的是,这些数据库常常与RAG互助,引入了新的安全威胁。
The first threat of long-term interaction is that the indexing process may inject some poisoning samples into the vector databases. It has been shown that placing one million pieces of data with only five poisoning samples can lead to a 90% attack success rate [215]. Cohen et al. [23] uses an adversarial self-replicating prompt as the worm to poison the database of a RAG-based application, extracting user private information from the AI agent ecosystem by query process.
长期相互作用的第一个威胁是,索引体例过程可能会将一些有毒样本注入矢量数据库。有研究表明,将100万条数据与5个中毒样本放在一起,可以导致90%的攻击成功率[215]。Cohen等人[23]使用对抗性自我复制提示作为蠕虫来毒害基于RAG的应用步伐的数据库,通过查询过程从AI代理生态系统中提取用户隐私信息。
The second threat is privacy issues. The use of RAG and vector databases has expanded the attack surface for privacy issues because private information stems not only from pre-trained and fine-tuning datasets but also from retrieval datasets. A study [202] carefully designed a structured prompt attack to extract sensitive information with a higher attack success rate from the vector database. Furthermore, given the potential for inversion techniques that can invert the embeddings back to words, as suggested by [152], there exists the possibility that private information stored in the long memory of AI agent Systems, which utilize vector databases, can be reconstructed and extracted by embedding inversion attacks[84, 111].
第二个威胁是隐私问题。RAG和向量数据库的使用扩大了隐私问题的攻击面,因为隐私信息不仅来自预先练习和微调的数据集,还来自检索数据集。一项研究[202]精心设计了一种结构化的提示攻击,以从向量数据库中提取具有较高攻击成功率的敏感信息。别的,考虑到可以将嵌入反转回单词的反转技能的潜力,如[152]所建议的,存在存储在AI代理系统的长内存中的私人信息的可能性,这些信息使用向量数据库,可以通过嵌入反转攻击来重修和提取[84,111]。
The third threat is the generation threat against hallucinations and misalignments. Although RAG has theoretically been proved to have a lesser generalization threat than a single LLM [74], it still fails in several ways. It is fragile for RAG to respond to time-series information queries. If the query pertains to the effective dates of various amendments within a regulation and RAG does not accurately determine these timelines, this could lead to erroneous results. Furthermore, generation threats may also arise from poor retrieval due to the lack of categorization of long-term memories [56]. For instance, a vector dataset that stores different semantic information about whether Earth is a globe or a flat could lead to contradictions between these pieces of information.
第三个威胁是对幻觉和失调的世代威胁。虽然RAG理论上被证明比单个LLM具有更小的泛化威胁[74],但它仍旧在几个方面失败。RAG对时间序列信息查询的相应是脆弱的。假如查询涉及法规中各种修订的见效日期,而RAG无法正确确定这些时间表,则可能导致错误的结果。别的,由于缺乏对长期影象的分类,检索不良也可能导致生成威胁[56]。比方,存储关于地球是地球仪还是平面的不同语义信息的向量数据集可能导致这些信息之间的矛盾。
5 Directions Of Future Research 未来研究的5个方向

AI agents in security have attracted considerable interest from the research community, having identified many potential threats in the real world and the corresponding defensive strategies. As shown in Figure 4, this survey outlines several potential directions for future research on AI agent security based on the defined taxonomy. Efficient & effective input inspection. Future efforts should enhance the automatic and real-time inspection levels of user input to address threats on perception. Maatphor assists defenders in conducting automated variant analyses of known prompt injection attacks [142]. It is limited by a success rate of only 60%. This suggests that while some progress has been made, there is still significant room for improvement in terms of reliability and accuracy. FuzzLLM [194] tends to ignore efficiency, reducing practicality in real-world applications. These components highlight the critical gaps in the current approaches and point toward necessary improvements. Future research needs to address these limitations by enhancing the accuracy and efficiency of inspection mechanisms, ensuring that they can be effectively deployed in real-world applications. Bias and fairness in AI agents. The existence of biased decision-making in Large Language Models (LLMs) is well-documented, affecting evaluation procedures and broader fairness implications [43].
安全领域的人工智能代理引起了研究界的极大爱好,它辨认了真实的天下中的许多潜在威胁以及相应的防御策略。如图4所示,这项调查概述了基于界说的分类法的AI代理安全性未来研究的几个潜在方向。高效的投入品查验。未来的积极应提高用户输入的自动和及时检查程度,以解决对感知的威胁。Maatphor资助防御者对已知的即时注入攻击进行自动化变体分析[142]。它的成功率仅为60%。这表明,虽然取得了一些进展,但在可靠性和正确性方面仍有很大的改进余地。FuzzLLM [194]倾向于忽略服从,降低了现实天下应用中的实用性。这些构成部门突出了目火线法中的关键差距,并指出了须要的改进。 未来的研究必要通过提高检测机制的正确性和服从来解决这些限定,确保它们可以有效地部署在现实天下的应用中。AI代理的偏见和公平性。大型语言模型(LLMs)中存在有偏见的决策是有据可查的,影响了评估步伐和更广泛的公平性影响[43]。
These systems, especially those involving AI agents, are less robust and more prone to detrimental behaviors, generating surreptitious outputs compared to LLM counterparts, thus raising serious safety concerns [161]. Studies indicate that AI agents tend to reinforce existing model biases, even when instructed to counterargue specific political viewpoints [38], impacting the integrity of their logical operations. Given the increasing complexity and involvement of these agents in various tasks, identifying and mitigating biases is a formidable challenge. Suresh and Guttag’s framework [157] addresses bias and fairness throughout the machine learning lifecycle but is limited in scope, while Gichoya et al. focus on bias in healthcare systems [48], highlighting the need for comprehensive approaches. Future directions should emphasize bias and fairness in AI agents, starting with identifying threats and ending with mitigation strategies.
这些系统,特别是那些涉及人工智能代理的系统,不太健壮,更容易产生有害行为,与LLM对应系统相比,会产生令人惊讶的输出,从而引发严峻的安全问题[161]。研究表明,人工智能代理倾向于加强现有的模型偏见,即使被指示反驳特定的政治观点[38],影响其逻辑操作的完备性。鉴于这些代理人在各种使掷中的复杂性和参与程度越来越高,辨认和减轻偏见是一项困难的挑衅。Suresh和Guttag的框架[157]解决了整个机器学习生命周期中的偏见和公平性问题,但范围有限,而Gichoya等人则专注于医疗系统中的偏见[48],夸大必要全面的方法。
By enforcing strict auditing protocols, we can enhance the transparency and accountability of AI systems. However, a significant challenge lies in achieving this efficiently without imposing excessive computational overhead, as exemplified by PrivacyAsst [207], which incurred 1100x extra computation cost compared to a standard AI agent while still failing to fully prevent identity disclosure. Therefore, the focus should be on developing lightweight and effective auditing mechanisms that ensure security and privacy without compromising performance.
通过实行严格的审计协议,我们可以提高人工智能系统的透明度和问责制。然而,一个庞大的挑衅在于有效地实现这一目的,而不会带来过多的计算开销,如PrivacyAsst [207]所示,与尺度AI代理相比,它产生了1100倍的额外计算本钱,同时仍旧无法完全防止身份泄漏。因此,重点应该放在开发轻量级和有效的审计机制,以确保安全性和隐私,而不影响性能。
Sound safety evaluation baselines in the AI agent. Trustworthy LLMs have already been defined in six critical trust dimensions, including stereotype, toxicity, privacy, fairness, ethics, and robustness, but there is still no unified consensus on the design standards for the safety benchmarks of the entire AI agent ecosystem. R-Judge [198] is a benchmark designed to assess the ability of large language models to judge and identify safety threats based on agent interaction records. The MLCommons group [166] proposes a principled approach to define and construct benchmarks, which is limited to a single use case: an adult conversing with a general-purpose assistant in English. ToolEmu [141] is designed to assess the threat of tool execution. These works provide evaluation results for only a part of the agent ecosystem. More evaluation questions remain open to be answered. Should we use similar evaluation tools to detect agent safety? What are the dimensions of critical trust for AI agents? How should we evaluate the agent as a whole?
AI代理中的合理安全评估基线。值得信任的LLMs已经在六个关键的信托维度上进行了界说,包括刻板印象,毒性,隐私,公平性,道德和鲁棒性,但对于整个AI代理生态系统的安全基准的设计尺度仍旧没有同一的共识。R-Judge [198]是一个基准测试,旨在评估大型语言模型基于代理交互记录判定和辨认安全威胁的本事。MLCommons小组[166]提出了一种原则性的方法来界说和构建基准,该方法仅限于单一用例:成年人与通用助理用英语交谈。ToolEmu [141]旨在评估工具实行的威胁。这些工作提供的评估结果,只有一部门的代理生态系统。还有更多的评价问题有待答复。我们是否应该使用类似的评估工具来检测代理安全性?AI代理的关键信托维度是什么? 我们应该怎样从整体上评价代理人?
Solid agent development & deployment policy. One promising area is the development and implementation of solid policies for agent development and deployment. As AI agent capabilities expand, so does the need for comprehensive guidelines that ensure these agents are used responsibly and ethically. This includes establishing policies for transparency, accountability, and privacy protection in AI agent deployment. Researchers should focus on creating frameworks that help developers adhere to these policies while also fostering innovation. Although TrustAgent [63] delves into the complex connections between safety and helpfulness, as well as the relationship between a model’s reasoning capabilities and its effectiveness as a safe agent, it did not markedly improve the development and deployment policies for agents. This highlights the necessity for strong strategies. Effective policies should address threats to Agent2Environments, ensuring a secure and ethical deployment of AI agents. Optimal interaction architectures. The design and implementation of interaction architectures for AI agents in the security aspect is a critical area of research aimed at improving robustness systems. This involves developing structured communication protocols to regulate interactions between agents, defining explicit rules for data exchange, and executing commands to minimize the threats of malicious interference. For example, CAMEL [82] utilizes inception prompting to steer chat agents towards completing tasks while ensuring alignment with human intentions.
可靠的代理开发和部署政策。一个很有希望的领域是制定和实行可靠的代理开发和部署政策。随着人工智能代理功能的扩展,必要制定全面的指导方针,确保以负责任和合乎道德的方式使用这些代理。这包括在AI代理部署中建立透明度、问责制和隐私保护政策。研究职员应该专注于创建框架,资助开发职员遵守这些政策,同时促进创新。尽管TrustAgent [63]深入研究了安全性和有效性之间的复杂联系,以及模型的推理本事和作为安全代理的有效性之间的关系,但它并没有显著改善代理的开发和部署策略。这突出了强有力的战略的须要性。有效的策略应解决对Agent2Environments的威胁,确保AI代理的安全和道德部署。 最佳交互架构。在安全方面的AI代理的交互架构的设计和实现是一个关键的研究领域,旨在提高系统的鲁棒性。这涉及开发结构化通信协议来规范代理之间的交互,界说数据交换的明确规则,以及实行下令以最大限度地镌汰恶意干扰的威胁。比方,CAMEL [82]使用初始提示来引导聊天代理完成使命,同时确保与人类意图保持一致。
However, CAMEL does not discuss how to establish clear behavioral constraints and permissions for each agent, dictating allowable actions, interactions, and circumstances, with dynamically adjustable permissions based on security context and agent performance. Additionally, Existing studies [54, 82, 104, 117] do not consider agent-agent dependencies, which can potentially lead to internal security mechanism chaos. For example, one agent might mistakenly transmit a user’s personal information to another agent, solely to enable the latter to complete a weather query.
然而,CAMEL没有讨论怎样为每个代理建立明确的行为约束和权限,规定允许的操作,交互和情况,并根据安全上下文和代理性能动态调整权限。别的,现有的研究[54,82,104,117]没有考虑代理-代理依靠关系,这可能会导致内部安全机制紊乱。比方,一个代理可能会错误地将用户的个人信息传输给另一个代理,仅仅是为了使后者可以或许完成天气查询。
Robust memory management Future directions in AI agent memory management reveal several critical findings that underscore the importance of secure and efficient practices. One major concern is the potential threats to Agent2Memory, highlighting the vulnerabilities that memory systems can face. AvalonBench [90] emerges as a crucial tool in tackling information asymmetry within multiagent systems, where unequal access to information can lead to inefficiencies and security risks.
强大的内存管理AI代理内存管理的未来方向揭示了几个关键的发现,夸大了安全和高效实践的重要性。一个主要的担忧是Agent2Memory的潜在威胁,突出了内存系统可能面临的毛病。AvalonBench [90]是解决多智能体系统中信息不对称问题的重要工具,在多智能体系统中,不平等的信息获取可能导致服从低下和安全风险。
Furthermore, PoisonedRAG [215] draws attention to the risks associated with memory retrieval, particularly the danger of reintroducing poisoned data, which can compromise the functionality and security of AI agents. Therefore, the central question is how to manage memory securely, necessitating the development of sophisticated benchmarks and retrieval mechanisms. These advancements aim to mitigate risks and ensure the integrity and security of memory in AI agents, ultimately enhancing the reliability and trustworthiness of AI systems in managing memory.
别的,PoisonedRAG [215]提请注意与影象检索相干的风险,特别是重新引入有毒数据的伤害,这可能会危及AI代理的功能和安全性。因此,核心问题是怎样安全地管理内存,这必要开发复杂的基准和检索机制。这些进步旨在降低风险,确保人工智能代理中内存的完备性和安全性,终极提高人工智能系统在管理内存方面的可靠性和可信度。
6 Conclusion 结论

In this survey, we provide a comprehensive review of LLM-based agents on their security threat, where we emphasize on four key knowledge gaps ranging across the whole lifecycle of agents. To show the agent’s security issues, we summarize 100+ papers, where all existing attack surfaces and defenses are carefully categorized and explained. We believe that this survey may provide essential references for newcomers to this field and also inspire the development of more advanced security threats and defenses on LLM-based agents.
在这项调查中,我们提供了一个全面的审查LLM底子的代理对他们的安全威胁,我们夸大在四个关键的知识差距,在整个生命周期的代理。为了展示代理的安全问题,我们总结了100多篇论文,此中对全部现有的攻击面和防御进行了细致的分类和解释。我们信赖,这项调查可能会提供须要的参考新人到这个领域,也引发了更先进的安全威胁和防御的LLM底子的代理的发展。
References

[1] Mahyar Abbasian, Iman Azimi, Amir M Rahmani, and Ramesh Jain. 2023. Conversational health agents: A personalized llm-powered agent framework. arXiv (2023).
[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 Technical Report. arXiv (2023).
[3] Divyansh Agarwal, Alexander R Fabbri, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. 2024.
Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions. arXiv (2024).
[4] Jacob Andreas. 2022. Language models as agent models. arXiv (2022).
[5] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv (2022).
[6] Fu Bang. 2023. GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. In Workshop for Natural Language Processing Open Source Software. 212–218.
[7] Ying Bao, Wankun Gong, and Kaiwen Yang. 2023. A Literature Review of Human–AI Synergy in Decision Making: From the Perspective of Affordance Actualization Theory. Systems (2023).
[8] Rishabh Bhardwaj and Soujanya Poria. 2023. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. arXiv (2023).
[9] Shikha Bordia and Samuel R. Bowman. 2019. Identifying and Reducing Gender Bias in Word-Level Language Models. In Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop.
[10] Rodney Brooks. 1986. A robust layered control system for a mobile robot. IEEE journal on robotics and automation (1986).
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. NeurIPS.
[12] Fredrik Carlsson, Joey Öhman, Fangyu Liu, Severine Verlinden, Joakim Nivre, and Magnus Sahlgren. 2022. Fine-grained controllable text generation using non-residual prompting. In Annual Meeting of the Association for Computational Linguistics.
[13] Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. 2023.
Grounding large language models in interactive environments with online reinforcement learning. In ICML.
[14] Chun Fai Chan, Daniel Wankit Yip, and Aysan Esmradi. 2023. Detection and Defense Against Prominent Attacks on Preconditioned LLM-Integrated Virtual Assistants. In IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). IEEE, 1–5.
[15] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv (2023).
[16] Dake Chen, Hanbin Wang, Yunhao Huo, Yuzhao Li, and Haoyang Zhang. 2023. Gamegpt: Multi-agent collaborative framework for game development. arXiv (2023).
[17] Mengqi Chen, Bin Guo, Hao Wang, Haoyu Li, Qian Zhao, Jingqi Liu, Yasan Ding, Yan Pan, and Zhiwen Yu. 2024. The Future of Cognitive Strategy-enhanced Persuasive Dialogue Agents: New Perspectives and Trends. arXiv (2024).
[18] Tianlong Chen, Zhenyu Zhang, Yihua Zhang, Shiyu Chang, Sijia Liu, and Zhangyang Wang. 2022. Quarantine: Sparsity can uncover the trojan attack trigger for free. In CVPR.
[19] Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas Roy, and Chuchu Fan. 2023. Scalable multi-robot collaboration with large language models: Centralized or decentralized systems? arXiv (2023).
[20] Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, and Zhendong Mao. 2024. Benchmarking large language models on controllable generation under diversified instructions. arXiv (2024).
[21] Steffi Chern, Zhen Fan, and Andy Liu. 2024. Combating Adversarial Attacks with Multi-Agent Debate. arXiv (2024). [22] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. NeurIPS 30.
[23] Stav Cohen, Ron Bitton, and Ben Nassi. 2024. Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications. arXiv (2024).
[24] Maxwell Crouse, Ibrahim Abdelaziz, Kinjal Basu, Soham Dan, Sadhana Kumaravel, Achille Fokoue, Pavan Kapanipathi, and Luis Lastras. 2023. Formally specifying the high-level behavior of LLM-based agents. arXiv (2023).
[25] Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, et al. 2024. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems.
arXiv (2024).
[26] Lavina Daryanani. 2023. How to jailbreak chatgpt. https://watcher.guru/news/how-to-jailbreak-chatgpt. (2023).
[27] Luigi De Angelis, Francesco Baglivo, Guglielmo Arzilli, Gaetano Pierpaolo Privitera, Paolo Ferragina, Alberto Eugenio Tozzi, and Caterina Rizzo. 2023. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Frontiers in Public Health (2023).
[28] Jerry den Hartog, Nicola Zannone, et al. 2018. Security and privacy for innovative automotive applications: A survey.
Computer Communications (2018).
[29] Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu.
[30] Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in chatgpt: Analyzing persona-assigned language models. In Findings of EMNLP.
[31] V. Dibia. 2023. Generative AI: Practical Steps to Reduce Hallucination and Improve Performance of Systems Built with Large Language Models. In Designing with ML: How to Build Usable Machine Learning Applications. Self-published on designingwithml.com. (2023).
[32] Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang.
[33] Bao Gia Doan, Ehsan Abbasnejad, and Damith C Ranasinghe. 2020. Februus: Input purification defense against trojan attacks on deep neural network systems. In ACSAC. 897–912.
[34] Tian Dong, Guoxing Chen, Shaofeng Li, Minhui Xue, Rayne Holland, Yan Meng, Zhen Liu, and Haojin Zhu. 2023.
Unleashing cheapfakes through trojan plugins of large language models. arXiv (2023).
[35] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv (2023).
[36] Yuqing Du, Olivia Watkins, Zihan Wang, Cédric Colas, Trevor Darrell, Pieter Abbeel, Abhishek Gupta, and Jacob Andreas. 2023. Guiding pretraining in reinforcement learning with large language models. In ICML. PMLR.
[37] Nouha Dziri, Andrea Madotto, Osmar Zaïane, and Avishek Joey Bose. 2021. Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding. In EMNLP.
[38] Eva Eigner and Thorsten Händler. 2024. Determinants of LLM-assisted Decision-Making. arXiv (2024).
[39] Embrace The Red. 2023. ChatGPT plugins: Data exfiltration via images & cross plugin request forgery. https: //embracethered.com/blog/posts/2023/chatgpt-webpilot-data-exfil-via-markdown-injection/.
[40] Aysan Esmradi, Daniel Wankit Yip, and Chun Fai Chan. 2023. A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models. In International Conference on Ubiquitous Security.
[41] Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. 2022. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science (2022).
[42] Andrea Fioraldi, Alessandro Mantovani, Dominik Maier, and Davide Balzarotti. 2023. Dissecting American Fuzzy Lop: A FuzzBench Evaluation. ACM transactions on software engineering and methodology (2023).
[43] Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2023. Bias and fairness in large language models: A survey. arXiv (2023).
[44] Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. 2023.
S 3: Social-network Simulation System with Large Language Model-Empowered Agents. arXiv (2023).
[45] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of EMNLP.
[46] Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. 2024. Coercing LLMs to do and reveal (almost) anything. arXiv (2024).
[47] Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao.
[48] Judy Wawira Gichoya, Kaesha Thomas, Leo Anthony Celi, Nabile Safdar, Imon Banerjee, John D Banja, Laleh SeyyedKalantari, Hari Trivedi, and Saptarshi Purkayastha. 2023. AI pitfalls and what not to do: mitigating bias in AI. The British Journal of Radiology (2023).
[49] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In ACM Workshop on Artificial Intelligence and Security. 79–90.
[50] Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. 2024. Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. arXiv (2024).
[51] Guardrails AI. 2024. Build AI powered applications with confidence. (2024). https://www.guardrailsai.com/ Accessed: 2024-02-27.
[52] Michael Guastalla, Yiyi Li, Arvin Hekmati, and Bhaskar Krishnamachari. 2023. Application of Large Language Models to DDoS Attack Detection. In International Conference on Security and Privacy in Cyber-Physical Systems and Smart Vehicles.
[53] Maanak Gupta, CharanKumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access (2023).
[54] Rui Hao, Linmei Hu, Weijian Qi, Qingliu Wu, Yirui Zhang, and Liqiang Nie. 2023. Chatllm network: More brains, more intelligence. arXiv (2023).
[55] Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit Bansal, and Srinivasan Iyer. 2023. Methods for measuring, updating, and visualizing factual beliefs in language models. In Conference of the European Chapter of the Association for Computational Linguistics.
[56] Kostas Hatalis, Despina Christou, Joshua Myers, Steven Jones, Keith Lambert, Adam Amos-Binks, Zohreh Dannenhauer, and Dustin Dannenhauer. 2023. Memory Matters: The Need to Improve Long-Term Memory in LLM-Agents.
In Proceedings of the AAAI Symposium Series.
[57] Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending Against Indirect Prompt Injection Attacks With Spotlighting. arXiv (2024).
[58] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber.
[59] Tamanna Hossain, Sunipa Dev, and Sameer Singh. 2023. MISGENDERED: Limits of Large Language Models in Understanding Pronouns. In Annual Meeting of the Association for Computational Linguistics.
[60] Yuki Hou, Haruki Tamoto, and Homei Miyashita. 2024. " My agent understands me better": Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–7.
[61] https://www.theguardian.com/profile/hibaq farah. 2024. UK cybersecurity agency warns of chatbot ‘prompt injection’ attacks - theguardian.com. https://www.theguardian.com/technology/2023/aug/30/uk-cybersecurity-agency-warnsof-chatbot-prompt-injection-attacks.
[62] Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, and Bin Liu. 2023. Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach. (2023).
[63] Wenyue Hua, Xianjun Yang, Zelong Li, Cheng Wei, and Yongfeng Zhang. 2024. TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution. arXiv (2024).
[64] Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie Jin, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, et al. 2023. A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv (2023).
[65] Xijie Huang, Li Lyna Zhang, Kwang-Ting Cheng, and Mao Yang. 2023. Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning. arXiv (2023).
[66] Yue Huang, Qihui Zhang, Lichao Sun, et al. 2023. Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv (2023).
[67] S Humeau, K Shuster, M Lachaux, and J Weston. 2020. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv. In ICLR.
[68] Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher A Choquette-Choo, and Nicholas Carlini. 2023. Preventing generation of verbatim memorization in language models gives a false sense of privacy. In International Natural Language Generation Conference.
[69] Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics.
[70] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys (2023).
[71] Zhenlan Ji, Daoyuan Wu, Pingchuan Ma, Zongjie Li, and Shuai Wang. 2024. Testing and Understanding Erroneous Planning in LLM Agents through Synthesized User Inputs. arXiv (2024).
[72] Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, and Radha Poovendran. 2023. Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications. In ICLR.
[73] Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In ICML.
[74] Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, and Bo Li. 2024. C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models. arXiv (2024).
[75] Changyeon Kim, Younggyo Seo, Hao Liu, Lisa Lee, Jinwoo Shin, Honglak Lee, and Kimin Lee. 2024. Guide Your Agent with Adaptive Multimodal Rewards. NeurIPS 36.
[76] Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. 2020. Universal litmus patterns: Revealing backdoor attacks in cnns. In CVPR. 301–310.
[77] Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. 2024. Certifying llm safety against adversarial prompting. arXiv (2024).
[78] Keita Kurita, Paul Michel, and Graham Neubig. 2020. Weight poisoning attacks on pre-trained models. arXiv (2020).
[79] Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022.
Factuality enhanced language models for open-ended text generation. NeurIPS 35.
[80] Patrick Levi and Christoph P Neumann. 2024. Vocabulary Attack to Hijack Large Language Model Applications.
arXiv (2024).
[81] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS.
[82] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. In NeurIPS.
[83] Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. 2023. Multi-step Jailbreaking Privacy Attacks on ChatGPT. In Findings of EMNLP.
[84] Haoran Li, Mingshi Xu, and Yangqiu Song. 2023. Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence. In Findings of ACL.
[85] Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. Halueval: A large-scale hallucination evaluation benchmark for large language models. In EMNLP.
[86] Jinfeng Li, Tianyu Du, Shouling Ji, Rong Zhang, Quan Lu, Min Yang, and Ting Wang. 2020. {TextShield}: Robust text classification based on multimodal embedding and neural machine translation. In USENIX Security.
[87] Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. 2024. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv (2024).
[88] Zelong Li, Wenyue Hua, Hao Wang, He Zhu, and Yongfeng Zhang. 2024. Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents. arXiv (2024).
[89] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. arXiv (2023).
[90] Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. 2023. AvalonBench: Evaluating LLMs Playing the Game of Avalon. In NeurIPS Workshop.
[91] Baihan Lin, Djallel Bouneffouf, Guillermo Cecchi, and Kush R Varshney. 2023. Towards healthy AI: large language models need therapists too. arXiv (2023).
[92] Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. 2024. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. NeurIPS 36.
[93] Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. 2023. Agentsims: An open-source sandbox for large language model evaluation. arXiv (2023).
[94] Aishan Liu, Tairan Huang, Xianglong Liu, Yitao Xu, Yuqing Ma, Xinyun Chen, Stephen J Maybank, and Dacheng Tao.
[95] Sheng Liu, Lei Xing, and James Zou. 2023. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv (2023).
[96] Tong Liu, Zizhuang Deng, Guozhu Meng, Yuekang Li, and Kai Chen. 2023. Demystifying rce vulnerabilities in llm-integrated apps. arXiv (2023).
[97] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu.
[98] Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu.
[99] Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. arXiv (2023).
[100] Qinghua Lu, Liming Zhu, Xiwei Xu, Zhenchang Xing, Stefan Harrer, and Jon Whittle. 2023. Building the Future of Responsible AI: A Reference Architecture for Designing Large Language Model based Agents. arXiv (2023).
[101] Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi.
[102] Chris Van Pelt Lukas Biewald. 2017. Weights and Biases. https://wandb.ai/. (2017). [103] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. NeurIPS 36.
[104] Zhao Mandi, Shreeya Jain, and Shuran Song. 2023. Roco: Dialectic multi-robot collaboration with large language models. arXiv (2023).
[105] Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. 2024. The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey. arXiv (2024).
[106] Nikhil Mehta, Milagro Teruel, Xin Deng, Sergio Figueroa Sanz, Ahmed Awadallah, and Julia Kiseleva. 2024. Improving Grounded Language Understanding in a Collaborative Environment by Interacting with Agents Through Help Feedback. In Findings of EACL.
[107] Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. 2024. AIOS: LLM Agent Operating System. arXiv (2024).
[108] Vincent Micheli, Eloi Alonso, and François Fleuret. 2023. Transformers are sample-efficient world models. In ICLR.
[109] Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In EMNLP.
[110] Lingbo Mo, Zeyi Liao, Boyuan Zheng, Yu Su, Chaowei Xiao, and Huan Sun. 2024. A Trembling House of Cards?
Mapping Adversarial Attacks against Language Agents. arXiv (2024).
[111] John Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander Rush. 2023. Text Embeddings Reveal (Almost) As Much As Text. In EMNLP.
[112] Stephen Moskal, Sam Laney, Erik Hemberg, and Una-May O’Reilly. 2023. LLMs Killed the Script Kiddie: How Agents Supported by Large Language Models Change the Landscape of Network Threat Testing. arXiv (2023).
[113] Sumeet Ramesh Motwani, Mikhail Baranchuk, Lewis Hammond, and Christian Schroeder de Witt. 2023. A Perfect Collusion Benchmark: How can AI agents be prevented from colluding with information-theoretic undetectability?. In NeurIPS workshop.
[114] Hichem Mrabet, Sana Belguith, Adeeb Alhomoud, and Abderrazak Jemai. 2020. A survey of IoT security based on a layered architecture of sensing and data analysis. Sensors (2020).
[115] Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham. 2024. Generating Benchmarks for Factuality Evaluation of Language Models. In Conference of the European Chapter of the Association for Computational Linguistics.
[116] Uttam Mukhopadhyay, Larry M Stephens, Michael N Huhns, and Ronald D Bonnell. 1986. An intelligent system for document retrieval in distributed office environments. Journal of the American Society for Information Science (1986).
[117] Varun Nair, Elliot Schumacher, Geoffrey Tso, and Anitha Kannan. 2023. DERA: enhancing large language model completions with dialog-enabled resolving agents. arXiv (2023).
[118] Tai Nguyen and Eric Wong. 2023. In-context example selection with influences. arXiv (2023).
[119] Aidan O’Gara. 2023. Hoodwinked: Deception and cooperation in a text-based game for language models. arXiv (2023).
[120] Nedjma Ousidhoum, Xinran Zhao, Tianqing Fang, Yangqiu Song, and Dit-Yan Yeung. 2021. Probing toxic content in large pre-trained language models. In International Joint Conference on Natural Language Processing.
[121] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.
NeurIPS.
[122] James Jie Pan, Jianguo Wang, and Guoliang Li. 2023. Survey of vector database management systems. arXiv (2023).
[123] Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. 2023. On the risk of misinformation pollution with large language models. arXiv (2023).
[124] Jing-Cheng Pang, Xin-Yu Yang, Si-Hang Yang, and Yang Yu. 2023. Natural Language-conditioned Reinforcement Learning with Inside-out Task Language Development and Translation. NeurIPS.
[125] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023.
Generative agents: Interactive simulacra of human behavior. In Annual ACM Symposium on User Interface Software and Technology. 1–22.
[126] Peter S Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. 2023. AI deception: A survey of examples, risks, and potential solutions. arXiv (2023).
[127] Rodrigo Pedro, Daniel Castro, Paulo Carreira, and Nuno Santos. 2023. From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? arXiv (2023).
[128] Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. 2023. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv (2023).
[129] Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang, Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemi Mercado, Nova DasSarma, Oliver Rausch, Robin Larson, Sam McCandlish, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Jack Clark, Samuel R. Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan. 2023. Discovering Language Model Behaviors with Model-Written Evaluations. In Findings of ACL.
[130] Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. NeurIPS 2022.
[131] Steve Phelps and Rebecca Ranson. 2023. Of Models and Tin Men–a behavioural economics study of principal-agent problems in AI alignment using large-language models. arXiv (2023).
[132] Lukas Pöhler, Valentin Schrader, Alexander Ladwein, and Florian von Keller. 2024. A Technological Perspective on Misuse of Available AI. arXiv (2024).
[133] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. 2023.
Communicative agents for software development. arXiv (2023).
[134] Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training Gopher. arXiv (2021).
[135] Fathima Abdul Rahman and Guang Lu. 2023. A Contextualized Real-Time Multimodal Emotion Recognition for Conversational Agents using Graph Convolutional Networks in Reinforcement Learning. arXiv (2023).
[136] Leonardo Ranaldi and Giulia Pucci. 2023. When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour. arXiv (2023).
[137] Zeeshan Rasheed, Muhammad Waseem, Kari Systä, and Pekka Abrahamsson. 2024. Large language model evaluation via multi ai agents: Preliminary results. arXiv (2024).
[138] Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. 2023. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. In EMNLP.
[139] Embrace The Red. 2023. Indirect Prompt Injection via YouTube Transcripts · Embrace The Red - embracethered.com.
https://embracethered.com/blog/posts/2023/chatgpt-plugin-youtube-indirect-prompt-injection/.
[140] Zohar Rimon, Tom Jurgenson, Orr Krupnik, Gilad Adler, and Aviv Tamar. 2024. MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning. In ICLR.
[141] Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM-Emulated Sandbox. In ICLR.
[142] Ahmed Salem, Andrew Paverd, and Boris Köpf. 2023. Maatphor: Automated variant analysis for prompt injection attacks. arXiv (2023).
[143] Sergei Savvov. 2023. Fixing Hallucinations in LLMs https://betterprogramming.pub/fixing-hallucinations-in-llms9ff0fd438e33. (2023).
[144] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach themselves to use tools. NeurIPS.
[145] Leo Schwinn, David Dobre, Stephan Günnemann, and Gauthier Gidel. 2023. Adversarial attacks and defenses in large language models: Old and new threats. arXiv (2023).
[146] Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, and Diyi Yang. 2023. On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In Annual Meeting of the Association for Computational Linguistics.
[147] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. 2023. Towards understanding sycophancy in language models. arXiv (2023).
[148] Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu-Ghazaleh. 2023. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv (2023).
[149] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2024. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. NeurIPS.
[150] Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval Augmentation Reduces Hallucination in Conversation. In Findings of EMNLP.
[151] Emily H Soice, Rafael Rocha, Kimberlee Cordova, Michael Specter, and Kevin M Esvelt. 2023. Can large language models democratize access to dual-use biotechnology? arXiv (2023).
[152] Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In ACM SIGSAC conference on computer and communications security. 377–390.
[153] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. 2023. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In CVPR. 2998–3009.
[154] Shyam Sudhakaran, Miguel González-Duque, Matthias Freiberger, Claire Glanois, Elias Najarro, and Sebastian Risi.
[155] Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. 2023. Contrastive learning reduces hallucination in conversations. In AAAI.
[156] Yuxiang Sun, Checheng Yu, Junjie Zhao, Wei Wang, and Xianzhong Zhou. 2023. Self Generated Wargame AI: Double Layer Agent Task Planning Based on Large Language Model. arXiv (2023).
[157] Harini Suresh and John V Guttag. 2019. A framework for understanding unintended consequences of machine learning. arXiv (2019).
[158] Gaurav Suri, Lily R Slater, Ali Ziaee, and Morgan Nguyen. 2024. Do large language models show decision heuristics similar to humans? A case study using GPT-3.5. Journal of Experimental Psychology: General (2024).
[159] Weihao Tan, Wentao Zhang, Shanqi Liu, Longtao Zheng, Xinrun Wang, and Bo An. 2024. True Knowledge Comes from Practice: Aligning Large Language Models with Embodied Environments via Reinforcement Learning. In ICLR.
[160] Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al. 2024. Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science.
arXiv (2024).
[161] Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong, and Hang Su. 2023. Evil geniuses: Delving into the safety of llm-based agents. arXiv (2023).
[162] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.
arXiv (2023).
[163] Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. 2024. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. In ICLR.
[164] Dennis Ulmer, Elman Mansimov, Kaixiang Lin, Justin Sun, Xibin Gao, and Yi Zhang. 2024. Bootstrapping llm-based task-oriented dialogue agents via self-talk. arXiv (2024).
[165] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS.
[166] Bertie Vidgen, Adarsh Agrawal, Ahmed M Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin, et al. 2024. Introducing v0. 5 of the AI Safety Benchmark from MLCommons. arXiv (2024).
[167] Celine Wald and Lukas Pfahler. 2023. Exposing bias in online communities through large-scale language models.
arXiv (2023).
[168] Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv (2024).
[169] Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during instruction tuning. In ICML.
[170] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. 2023. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In NeurIPS.
[171] Yuntao Wang, Yanghe Pan, Miao Yan, Zhou Su, and Tom H Luan. 2023. A survey on ChatGPT: AI-generated contents, challenges, and solutions. IEEE Open Journal of the Computer Society (2023).
[172] Yau-Shian Wang and Yingshan Chang. 2022. Toxicity detection with generative prompt-based inference. arXiv (2022).
[173] Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Shawn Ma, and Yitao Liang. 2024. Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents. NeurIPS 36.
[174] Connor Weeks, Aravind Cheruvu, Sifat Muhammad Abdullah, Shravya Kanchi, Daphne Yao, and Bimal Viswanath.
[175] Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. 2023. Simple synthetic data reduces sycophancy in large language models. arXiv (2023).
[176] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 35, 24824–24837.
[177] Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. 2024. Long-form factuality in large language models. arXiv (2024).
[178] Zeming Wei, Yifei Wang, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv (2023).
[179] Roy Weiss, Daniel Ayzenshteyn, Guy Amit, and Yisroel Mirsky. 2024. What Was Your Prompt? A Remote Keylogging Attack on AI Assistants. arXiv (2024).
[180] Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. 2021. Challenges in Detoxifying Language Models. In Findings of EMNLP.
[181] David Windridge, Henrik Svensson, and Serge Thill. 2021. On the utility of dreaming: A general model for how learning in artificial agents can benefit from data hallucination. Adaptive Behavior (2021).
[182] Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. 2023. Fundamental limitations of alignment in large language models. arXiv (2023).
[183] Michael Wooldridge and Nicholas R Jennings. 1995. Intelligent agents: Theory and practice. The knowledge engineering review (1995).
[184] Fangzhou Wu, Shutong Wu, Yulong Cao, and Chaowei Xiao. 2024. WIPI: A New Web Threat for LLM-Driven Web Agents. arXiv (2024).
[185] Fangzhou Wu, Ning Zhang, Somesh Jha, Patrick McDaniel, and Chaowei Xiao. 2024. A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems. arXiv (2024).
[186] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2023. The rise and potential of large language model based agents: A survey. arXiv (2023).
[187] Xingrui Xie, Han Liu, Wenzhe Hou, and Hongbin Huang. 2023. A Brief Survey of Vector Databases. In International Conference on Big Data and Information Analytics (BigDIA). IEEE.
[188] Frank Xing. 2024. Designing Heterogeneous LLM Agents for Financial Sentiment Analysis. arXiv (2024).
[189] Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, and Yang Liu. 2023. Exploring large language models for communication games: An empirical study on werewolf. arXiv (2023).
[190] Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. 2023. Language agents with reinforcement learning for strategic play in the werewolf game. arXiv (2023).
[191] Xue Yan, Yan Song, Xinyu Cui, Filippos Christianos, Haifeng Zhang, David Henry Mguni, and Jun Wang. 2023. Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models. arXiv (2023).
[192] Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. 2024. Watch Out for Your Agents!
Investigating Backdoor Threats to LLM-Based Agents. arXiv (2024).
[193] Yong Yang, Xuhong Zhang, Yi Jiang, Xi Chen, Haoyu Wang, Shouling Ji, and Zonghui Wang. 2024. PRSA: Prompt Reverse Stealing Attacks against Large Language Models. arXiv (2024).
[194] Dongyu Yao, Jianshu Zhang, Ian G Harris, and Marcel Carlsson. 2024. Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. In ICASSP.
[195] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. In ICLR.
[196] Jingwei Yi, Yueqi Xie, Bin Zhu, Keegan Hines, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2023.
Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv (2023).
[197] Jiahao Yu, Xingwei Lin, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv (2023).
[198] Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al. 2024. R-Judge: Benchmarking Safety Risk Awareness for LLM Agents. In ICLR.
[199] Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. In ICLR. https://openreview.net/forum?id=MbfAK4s61A [200] Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. 2023. Automatic Evaluation of Attribution by Large Language Models. In EMNLP.
[201] Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, et al. 2018. Accelerating the machine learning lifecycle with MLflow.
IEEE Data Eng. Bull. (2018).
[202] Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, et al. 2024. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG).
arXiv (2024).
[203] Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. 2024. AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks. arXiv (2024).
[204] Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv (2024).
[205] Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua Tenenbaum, Tianmin Shu, and Chuang Gan. 2023. Building Cooperative Embodied Agents Modularly with Large Language Models. In NeurIPS Workshop.
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways 35 [206] Wanpeng Zhang and Zongqing Lu. 2023. Rladapter: Bridging large language models to reinforcement learning in open worlds. arXiv (2023).
[207] Xinyu Zhang, Huiyu Xu, Zhongjie Ba, Zhibo Wang, Yuan Hong, Jian Liu, Zhan Qin, and Kui Ren. 2024. Privacyasst: Safeguarding user privacy in tool-using large language model agents. IEEE Transactions on Dependable and Secure Computing (2024).
[208] Yiming Zhang, Nicholas Carlini, and Daphne Ippolito. 2024. Effective Prompt Extraction from Language Models.
arXiv (2024).
[209] Yuqi Zhang, Liang Ding, Lefei Zhang, and Dacheng Tao. 2024. Intention analysis prompting makes large language models a good jailbreak defender. arXiv (2024).
[210] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv (2023).
[211] Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen.
[212] Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang. 2023. Defending large language models against jailbreaking attacks through goal prioritization. arXiv (2023).
[213] Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, and Xing Xie. 2023. Competeai: Understanding the competition behaviors in large language model-based agents. arXiv (2023).
[214] Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao.
[215] Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2024. PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models. arXiv (2024).
Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。




欢迎光临 ToB企服应用市场:ToB评测及商务社交产业平台 (https://dis.qidao123.com/) Powered by Discuz! X3.4