利用 Elasticsearch 和 LlamaIndex 保护 RAG 中的敏感信息和 PII 信息 ...

不到断气不罢休 · 2024-8-20 04:45:31

作者：来自 Elastic Srikanth Manvi

在这篇文章中，我们将研究在 RAG（检索增强生成）流程中利用公共 LLMs 时保护个人身份信息 (personal identifiable information - PII) 和敏感数据的方法。我们将探索利用开源库和正则表达式屏蔽 PII 和敏感数据，以及在调用公共 LLM 之前利用本地 LLMs 屏蔽数据。
在开始之前，让我们回首一下我们在这篇文章中利用的一些术语。

术语

LlamaIndex 是用于构建 LLM（Large Language Model - 大型语言模子）应用步伐的领先数据框架。LlamaIndex 为构建 RAG（检索增强生成）应用步伐的各个阶段提供抽象。LlamaIndex 和 LangChain 等框架提供抽象，以便应用步伐不会与任何特定 LLM 的 API 紧密耦合。
Elasticsearch 由 Elastic 提供。Elastic 是 Elasticsearch 背后的行业向导者，Elasticsearch 是一种搜索和分析引擎，支持精确的全文搜索、语义理解的向量搜索和兼具一箭双雕的混淆搜索。Elasticsearch 是一种分布式 RESTful 搜索和分析引擎、可扩展的数据存储和向量数据库。我们在本博客中利用的 Elasticsearch 功能可在 Elasticsearch 的免费和开放版本中利用。
检索增强生成 (Retrieval-Augmented Generation - RAG) 是一种 AI 技术/模式，此中 LLMs 提供外部知识来生成对用户查询的响应。这使得 LLM 响应可以根据特定上下文进行定制，而不那么通用。
嵌入（embeddings）是文本/媒体含义的数值表示。它们是高维信息的低维表示。

RAG 和数据保护

一般来说，大型语言模子 (LLMs) 擅长根据模子中可用的信息生成响应，这些信息可能在互联网数据上进行训练。但是，对于那些模子中没有信息的查询，LLMs 需要提供外部知识或模子中不包含的特定细节。这些信息可能在你的数据库或内部知识系统中。检索增强生成 (RAG) 是一种技术，对于给定的用户查询，你起首从外部（对 LLMs）系统（例如你的数据库）检索相关上下文/信息，然后将该上下文与用户查询一起发送到 LLM 以生成更详细和相关的响应。
这使得 RAG 技术在问答、内容创建以及任何需要深入相识上下文和细节的地方都非常有效。
因此，在 RAG 管道中，你可能会将内部信息（如 PII（personal identifiable information - 个人身份信息））和敏感信息（例如姓名、出生日期、帐号等）袒露给公共 LLM。
虽然利用 Elasticsearch 等向量数据库时数据是安全的（通过基于角色的访问控制、文档级别安全性等各种手段），但将数据发送到公共 LLMs 外部时必须警惕谨慎。
利用大型语言模子 (LLM) 时，出于以下几个缘故原由，保护个人身份信息 (PII) 和敏感数据至关紧张：

隐私合规性：很多地域都有严格的法规，例如欧洲的《General Data Protection Regulation - 通用数据保护条例》(GDPR) 或美国的《California Consumer Privacy Act - 加州消费者隐私法案》(CCPA)，这些法规要求保护个人数据。服从这些法律是避免法律后果和罚款的必要条件。
用户信任：确保敏感信息的秘密性和完整性可以创建用户信任。用户更有可能利用和与他们认为可以保护其隐私的系统进行交互。
数据安全：防止数据泄漏至关紧张。假如没有充实的保护措施，袒露给 LLM 的敏感数据很容易被盗窃或滥用，从而导致身份盗窃或金融敲诈等潜在危害。
道德考量：从道德角度而言，尊重用户的隐私并负责任地处置惩罚其数据非常紧张。不妥处置惩罚 PII 可能导致歧视、污名化或其他负面社会影响。
商业声誉：未能保护敏感数据的公司可能会遭受声誉侵害，这可能会对其业务产生长期负面影响，包罗客户流失和收入损失。
降低滥用风险：安全处置惩罚敏感数据有助于防止恶意利用数据或模子，例如利用有偏见的数据训练模子或利用数据操纵或伤害个人。

总体而言，对 PII 和敏感数据进行强有力的保护对于确保法律合规、维护用户信任、确保数据安全、维护道德尺度、保护商业声誉和降低滥用风险是必不可少的。

快速回首

在上一篇文章中，我们讨论了怎样利用 RAG 技术实现问答体验，利用 Elasticsearch 作为向量数据库，同时利用 LlamaIndex 和本地运行的 Mistral LLM。这里我们在此底子上进行构建。
阅读上一篇文章是可选的，因为我们现在将快速讨论/回首我们在上一篇文章中所做的工作。
我们有一个虚构家庭保险公司的客服人员和客户之间的召唤中央对话样本数据集。我们构建了一个简单的 RAG 应用步伐，它可以答复诸如 “What kind of water related issues are customers filing claims for - 客户提出索赔的是什么类型的水相关问题？” 之类的问题。
从高层次来看，流程是这样的。

在索引阶段，我们利用 LlamaIndex 管道加载和索引文档。文档被分块并与其嵌入一起存储在 Elasticsearch 向量数据库中。
在查询阶段，当用户提出问题时，LlamaIndex 检索与查询相关的前 K 个相似文档。这些前 K 个相关文档连同查询一起被发送到本地运行的 Mistral LLM，然后生成要发送回用户的响应。请随意阅读上一篇文章或探索代码。
在上一篇文章中，我们在本地运行了 LLM。但是，在生产中，你可能希望利用由 OpenAI、Mistral、Anthropic 等各种公司提供的外部 LLM。这可能是因为你的用例需要更大的底子模子，或者由于企业生产需求（如可扩展性、可用性、性能等）而无法在本地运行。
在你的 RAG 管道中引入外部 LLM 会使你面对偶然中将敏感信息和 PII 泄漏给 LLM 的风险。在这篇文章中，我们将探讨在将文档发送到外部 LLM 之前怎样将 PII 信息作为 RAG 管道的一部分进行屏蔽的选项。

具有公共 LLM 的 RAG

在讨论怎样在 RAG 管道中保护你的 PII 和敏感信息之前，我们将起首利用 LlamaIndex、Elasticsearch Vector 数据库和 OpenAI LLM 构建一个简单的 RAG 应用步伐。

先决条件

我们需要以下内容。

Elasticsearch 作为向量数据库启动并运行以存储嵌入。按照关于安装 Elasticsearch 的文章中的说明进行操作。
OpenAI API 密钥。

简单的 RAG 应用步伐

作为参考，完整代码可在此 Github 存储库（分支：protecting-pii）中找到。克隆存储库是可选的，因为我们将先容下面的代码。
在你最喜欢的 IDE 中，利用以下 3 个文件创建一个新的 Python 应用步伐。

index.py，用于存放与索引数据相关的代码。
query.py，用于存放与查询和 LLM 交相互关的代码。
.env，用于存放 API 密钥等配置属性。

我们需要安装一些软件包。我们起首在应用步伐的根文件夹中创建一个新的 Python 虚拟情况。

python3 -m venv .venv

复制代码

激活虚拟情况并安装以下所需的包。

source .venv/bin/activate
pip install llama-index
pip install llama-index-embeddings-openai
pip install llama-index-vector-stores-elasticsearch
pip install sentence-transformers
pip install python-dotenv
pip install openai

复制代码

在 .env 文件中配置 OpenAI 和 Elasticsearch 连接属性。

OPENAI_API_KEY="REPLACEME"
ELASTIC_CLOUD_ID="REPLACEME"
ELASTIC_API_KEY="REPLACEME"

复制代码

索引数据

下载 conversations.json 文件，此中包含我们虚构的家庭保险公司的客户和召唤中央署理之间的对话。将该文件放在应用步伐的根目次中，与之前创建的 2 个 python 文件和 .env 文件放在一起。以下是该文件内容的示例。

{
"conversation_id": 103,
"customer_name": "Sophia Jones",
"agent_name": "Emily Wilson",
"policy_number": "JKL0123",
"conversation": "Customer: Hi, I'm Sophia Jones. My Date of Birth is November 15th, 1985, Address is 303 Cedar St, Miami, FL 33101, and my Policy Number is JKL0123.\nAgent: Hello, Sophia. How may I assist you today?\nCustomer: Hello, Emily. I have a question about my policy.\nCustomer: There's been a break-in at my home, and some valuable items are missing. Are they covered?\nAgent: Let me check your policy for coverage related to theft.\nAgent: Yes, theft of personal belongings is covered under your policy.\nCustomer: That's a relief. I'll need to file a claim for the stolen items.\nAgent: We'll assist you with the claim process, Sophia. Is there anything else I can help you with?\nCustomer: No, that's all for now. Thank you for your assistance, Emily.\nAgent: You're welcome, Sophia. Please feel free to reach out if you have any further questions or concerns.\nCustomer: I will. Have a great day!\nAgent: You too, Sophia. Take care.",
"summary": "A customer inquires about coverage for stolen items after a break-in at home, and the agent confirms that theft of personal belongings is covered under the policy. The agent offers assistance with the claim process, resulting in the customer expressing relief and gratitude."
}

复制代码

在 index.py 中粘贴以下代码，用于索引数据。
index.py

# index.py
# pip install sentence-transformers
# pip install llama-index-embeddings-openai
# pip install llama-index-embeddings-huggingface
import json
import os
from dotenv import load_dotenv
from llama_index.core import Document
from llama_index.core import Settings
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.elasticsearch import ElasticsearchStore
def get_documents_from_file(file):
"""Reads a json file and returns list of Documents"""
with open(file=file, mode='rt') as f:
conversations_dict = json.loads(f.read())
# Build Document objects using fields of interest.
documents = [Document(text=item['conversation'],
metadata={"conversation_id": item['conversation_id']})
for
item in conversations_dict]
return documents
# Load .env file contents into env
load_dotenv('.env')
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5"
)
def main():
# ElasticsearchStore is a VectorStore that
# takes care of Elasticsearch Index and Data management.
es_vector_store = ElasticsearchStore(index_name="convo_index",
vector_field='conversation_vector',
text_field='conversation',
es_cloud_id=os.getenv("ELASTIC_CLOUD_ID"),
es_api_key=os.getenv("ELASTIC_API_KEY"))
# LlamaIndex Pipeline configured to take care of chunking, embedding
# and storing the embeddings in the vector store.
llamaindex_pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=350, chunk_overlap=50),
Settings.embed_model
],
vector_store=es_vector_store
)
# Load data from a json file into a list of LlamaIndex Documents
documents = get_documents_from_file(file="conversations.json")
llamaindex_pipeline.run(documents=documents)
print(".....Indexing Data Completed.....\n")
if __name__ == "__main__":
main()

复制代码

运行上述代码，查看在 Elasticsearch 中创建索引，并将嵌入存储在名为 convo_index 的 Elasticsearch 索引中。
假如你需要有关 LlamaIndex IngestionPipeline 的说明，请在上一篇文章中参阅 “创建 IngestionPipeline” 部分。

查询

在上一篇文章中，我们利用了本地 LLM 进行查询。
在这篇文章中，我们利用公共 LLM，OpenAI，如下所示。
query.py

# query.py
from llama_index.core import VectorStoreIndex, QueryBundle, Settings
from llama_index.llms.openai import OpenAI
from index import es_vector_store
# Public LLM where we send user query and Related Documents
llm = OpenAI()
index = VectorStoreIndex.from_vector_store(es_vector_store)
# This query_engine, for a given user query retrieves top 10 similar documents from
# Elasticsearch vector database and sends the documents along with the user query to the LLM.
# Note that documents are sent as-is. So any PII/Sensitive data is sent to the LLM.
query_engine = index.as_query_engine(llm, similarity_top_k=10)
query="Give me summary of water related claims that customers raised."
bundle = QueryBundle(query, embedding=Settings.embed_model.get_query_embedding(query))
result = query_engine.query(bundle)
print(result)

复制代码

上述代码打印了 OpenAI 的响应，如下所示。
客户提出了各种与水有关的索赔，包罗地下室水损、管道爆裂、屋顶冰雹粉碎等问题，以及由于未及时关照、维护问题、逐渐磨损和预先存在的粉碎等缘故原由而拒绝索赔。在每种情况下，客户都对索赔被拒绝表示沮丧，并寻求对其索赔进行公平的评估和决定。

在 RAG 中屏蔽 PII

到现在为止，我们先容的内容涉及将文档与用户查询一起按原样发送给 OpenAI。
在 RAG 管道中，从 Vector 存储中检索相关上下文后，我们有时机在将查询和上下文发送到 LLM 之前屏蔽 PII 和敏感信息。
在将 PII 信息发送到外部 LLM 之前，有多种方法可以屏蔽 PII 信息，每种方法都有本身的优点。我们来看看下面的一些选项

利用 NLP 库，如 spacy.io 或 Presidio（由 Microsoft 维护的开源库）。
利用 LlamaIndex 开箱即用的 NERPIINodePostprocessor。
通过 PIINodePostprocessor 利用本地 LLMs

利用上述任何一种方式实现屏蔽逻辑后，你可以利用 PostProcessor（你本身的自界说 PostProcessor 或 LlamaIndex 提供的任何开箱即用的 PostProcessor）配置 LlamaIndex IngestionPipeline。

利用 NLP 库

作为 RAG 管道的一部分，我们可以利用 NLP 库屏蔽敏感数据。我们将在此演示中利用 spacy.io 包。
创建一个新文件 query_masking_nlp.py 并添加以下代码。
query_masking_nlp.py

# query_masking_nlp.py
# pip install spacy
# python3 - m spacy download en_core_web_sm
import re
from typing import List, Optional
import spacy
from llama_index.core import VectorStoreIndex, QueryBundle, Settings
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from index import es_vector_store
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")
# Compile regex patterns for performance
phone_pattern = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b')
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
date_pattern = re.compile(r'\b(\d{1,2}[-/]\d{1,2}[-/]\d{2,4}|\d{2,4}[-/]\d{1,2}[-/]\d{1,2})\b')
dob_pattern = re.compile(
r"(January|February|March|April|May|June|July|August|September|October|November|December)\s(\d{1,2})(st|nd|rd|th),\s(\d{4})")
address_pattern = re.compile(r'\d+\s+[\w\s]+\,\s+[A-Za-z]+\,\s+[A-Z]{2}\s+\d{5}(-\d{4})?')
zip_code_pattern = re.compile(r'\b\d{5}(?:-\d{4})?\b')
policy_number_pattern = re.compile(r"[A-Z]{3}\d{4}\.$") # 3 characters followed by 4 digits, in our case e.g XYZ9876
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# match = re.match(policy_number_pattern, "XYZ9876")
# print(match)
def mask_pii(text):
"""
Masks Personally Identifiable Information (PII) in the given
text using pre-defined regex patterns and spaCy's named entity recognition.
Args:
text (str): The input text containing potential PII.
Returns:
str: The text with PII masked.
"""
# Process the text with spaCy for NER
doc = nlp(text)
# Mask entities identified by spaCy NER (e.g First/Last Names etc)
for ent in doc.ents:
if ent.label_ in ["PERSON", "ORG", "GPE"]:
text = text.replace(ent.text, '[MASKED]')
# Apply regex patterns after NER to avoid overlapping issues
text = phone_pattern.sub('[PHONE MASKED]', text)
text = email_pattern.sub('[EMAIL MASKED]', text)
text = date_pattern.sub('[DATE MASKED]', text)
text = address_pattern.sub('[ADDRESS MASKED]', text)
text = dob_pattern.sub('[DOB MASKED]', text)
text = zip_code_pattern.sub('[ZIP MASKED]', text)
text = policy_number_pattern.sub('[POLICY MASKED]', text)
return text
class CustomPostProcessor(BaseNodePostprocessor):
"""
Custom Postprocessor which masks Personally Identifiable Information (PII).
PostProcessor is called on the Documents before they are sent to the LLM.
"""
def _postprocess_nodes(
self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle]
) -> List[NodeWithScore]:
# Masks PII
for n in nodes:
n.node.set_content(mask_pii(n.text))
return nodes
# Use Public LLM to send user query and Related Documents
llm = OpenAI()
index = VectorStoreIndex.from_vector_store(es_vector_store)
# This query_engine, for a given user query retrieves top 10 similar documents from
# Elasticsearch vector database and sends the documents along with the user query to the LLM.
# Note that documents are masked based on custom logic defined in CustomPostProcessor._postprocess_nodes.
query_engine = index.as_query_engine(llm, similarity_top_k=10, node_postprocessors=[CustomPostProcessor()])
query = "Give me summary of water related claims that customers raised."
bundle = QueryBundle(query, embedding=Settings.embed_model.get_query_embedding(query))
response = query_engine.query(bundle)
print(response)

复制代码

LLM 的回应如下所示。
Customers have raised various water-related claims, including issues such as water damage in basements, burst pipes, hail damage to roofs, and flooding during heavy rainfall. These claims have led to frustrations due to claim denials based on reasons such as lack of timely notification, maintenance issues, gradual wear and tear, and pre-existing damage. Customers have expressed disappointment, stress, and financial burden as a result of these claim denials, seeking fair evaluations and thorough reviews of their claims. Some customers have also faced delays in claim processing, causing further dissatisfaction with the service provided by the insurance company.
  在上面的代码中，当创建 Llama Index QueryEngine 时，我们提供了一个 CustomPostProcessor。
QueryEngine 调用的逻辑在 CustomPostProcessor 的 _postprocess_nodes 方法中界说。我们利用 SpaCy.io 库来检测我们的定名实体，然后在将文档发送到 LLM 之前利用一些正则表达式来替换这些名称以及敏感信息。
以下是原始对话的部分内容和 CustomPostProcessor 创建的 Masked 对话。
原文：
Customer: Hi, I'm Matthew Lopez, DOB is October 12th, 1984, and I live at 456 Cedar St, Smalltown, NY 34567. My Policy Number is TUV8901. Agent: Good afternoon, Matthew. How can I assist you today? Customer: Hello, I'm extremely disappointed with your company's decision to deny my claim.
  由 CustomPostProcessor 屏蔽的文本。
Customer: Hi, I'm [MASKED], [MASKED] is [DOB MASKED], and I live at 456 Cedar St, [MASKED], [MASKED] 34567. My Policy Number is [MASKED]. Agent: Good afternoon, [MASKED]. How can I assist you today? Customer: Hello, I'm extremely disappointed with your company's decision to deny my claim.
  注意：
辨认和屏蔽 PII 和敏感信息并非易事。涵盖敏感信息的各种格式和语义需要对你的范畴和数据有很好的相识。虽然上面提供的代码可能实用于某些用例，但你可能需要根据你的需求和测试进行修改。

利用 LlamaIndex 开箱即用的 NERPIINodePostprocessor

LlamaIndex 通过引入 NERPIINodePostprocessor，使得保护 RAG 管道中的 PII 信息变得更加容易。

from llama_index.core import VectorStoreIndex, QueryBundle, Settings
from llama_index.core.postprocessor import NERPIINodePostprocessor
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from index import es_vector_store
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Use Public LLM to send user query and Related Documents
llm = OpenAI()
ner_processor = NERPIINodePostprocessor()
index = VectorStoreIndex.from_vector_store(es_vector_store)
# This query_engine, for a given user query retrieves top 10 similar documents from
# Elasticsearch vector database and sends the documents along with the user query to the LLM.
# Note that documents masked using the NERPIINodePostprocessor so that PII/Sensitive data is not sent to the LLM.
query_engine = index.as_query_engine(llm, similarity_top_k=10, node_postprocessors=[ner_processor])
query = "Give me summary of fire related claims that customers raised."
bundle = QueryBundle(query, embedding=Settings.embed_model.get_query_embedding(query))
response = query_engine.query(bundle)
print(response)

复制代码

响应如下所示：
Customers have raised fire-related claims regarding damage to their properties. In one case, a claim for fire damage to a garage was denied due to arson being excluded from coverage. Another customer filed a claim for fire damage to their home, which was covered under their policy. Additionally, a customer reported a kitchen fire and was assured that fire damage was covered.

通过 PIINodePostprocessor 利用本地 LLM

我们还可以利用本地或私有网络中运行的 LLM 来完成屏蔽工作，然后再将数据发送到公共 LLM。
我们将利用本地呆板上运行 Ollama 的 Mistral 进行屏蔽。

在本地运行 Mistral

下载并安装 Ollama。安装 Ollama 后，运行此命令下载并运行 mistral

ollama run mistral

复制代码

首次在本地下载并运行模子可能需要几分钟时间。通过询问雷同下面的问题 “Write a poem about clouds” 来验证 mistral 是否正在运行，并验证这首诗是否符合你的喜好。保持 ollama 运行，因为我们稍后需要通过代码与 mistral 模子进行交互。
创建一个名为 query_masking_local_LLM.py 的新文件并添加以下代码。
query_masking_local_LLM.py

# pip install llama-index-llms-ollama
from llama_index.core import VectorStoreIndex, QueryBundle, Settings
from llama_index.core.postprocessor import PIINodePostprocessor
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.llms.openai import OpenAI
from index import es_vector_store
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Use Public LLM to send user query and Related Documents and Local LLM to mask
public_llm = OpenAI()
local_llm = Ollama(model="mistral")
pii_processor = PIINodePostprocessor(llm=local_llm)
index = VectorStoreIndex.from_vector_store(es_vector_store)
# This query_engine, for a given user query retrieves top 10 similar documents from
# Elasticsearch vector database and sends the documents along with the user query to the public LLM.
# Note that documents are masked using the local llm via PIINodePostprocessor
# so that PII/Sensitive data is not sent to the public LLM.
query_engine = index.as_query_engine(public_llm, similarity_top_k=10, node_postprocessors=[pii_processor])
query = "Give me summary of fire related claims that customers raised."
bundle = QueryBundle(query, embedding=Settings.embed_model.get_query_embedding(query))
result = query_engine.query(bundle)
print(result)

复制代码

响应如下所示：
Customers have raised fire-related claims regarding damage to their properties. In one case, a claim for fire damage to a garage was denied due to arson being excluded from coverage. Another customer filed a claim for fire damage to their home, which was covered under their policy. Additionally, a customer reported a kitchen fire and was assured that fire damage was covered.

结论

在这篇文章中，我们展示了怎样在 RAG 流程中利用公共 LLM 来保护 PII 和敏感数据。我们演示了实现这一目标的多种方法。强烈发起在采用之前根据你的用例和需求测试这些方法。
预备好本身实验了吗？开始免费试用。
Elasticsearch 集成了 LangChain、Cohere 等工具。参加我们的高级语义搜索网络研讨会，构建你的下一个 GenAI 应用步伐！

原文：RAG: How to protect sensitive and PII info with Elasticsearch & LlamaIndex — Search Labs

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

利用 Elasticsearch 和 LlamaIndex 保护 RAG 中的敏感信息和 PII 信息 ...

本帖子中包含更多资源

0 个回复

快速回复

楼主热帖

标签云