尽管我们的技术支持知识库有超过 2,800 篇文章,但我们知道有些标题这些文章无法解答 Elastic 支持助手的用户标题。例如:What new features would be available if I upgraded from Elastic Cloud 8.11 to 8.14? 不会出如今技术支持文章中,因为这不是一个故障修复标题,也不会出如今 OpenAI 模子中,因为 8.14 已经过了模子训练日期截止日期。
我们选择通过包罗更多官方 Elastic 泉源(例如所有版本的产品文档、Elastic 博客、搜索/安全/可观察性实验室和 Elastic 入门指南)作为语义搜索实现的泉源(类似于此示例)来解决这个标题。通过使用语义搜索在相关时检索这些文档,我们使支持助手可以大概回答更广泛的标题。
提取过程包括数十万份文档,并处理跨 Elastic 属性的复杂站点舆图。我们选择使用名为 Crawlee 的抓取和自动化库来处理保持知识库最新所需的规模和频率。
四个爬虫作业中的每一个都在 Google Cloud Run 上实行。我们之以是选择如许做,是因为作业的超时时间为 24 小时,并且可以在不使用 Cloud Tasks 或 PubSub 的环境下安排它们。我们的需求导致统共有四个作业并行运行,每个作业都有一个可以捕获特定种别文档的基本 URL。在抓取网站时,我们建议从没有重叠内容的基本 URL 开始,以制止提取重复项。这必须与在过高的级别抓取和提取对你的知识存储无用的文档保持平衡。例如,我们抓取 https://elastic.com/blog 和 https://www.elastic.co/search-labs/blog 而不是 elastic.co/,因为我们的目标是技术文档。
ml 字段中的嵌入存储为关键字和向量对。发出搜索查询时,它也会转换为嵌入。嵌入接近查询嵌入的文档被视为相关文档,并会进行检索和排名。以下示例是 title 字段 “How to install and run the support diagnostics troubleshooting utility” 的 ELSER 嵌入的样子 。虽然下面只显示了 title,但该字段还将包罗 summary 的所有向量嵌入。
For \`generate[ì]\`, you will generate increasingly concise, entity-dense summaries of \`data\`, considering content fields only,respecting the instructions in \`generate[i].prompt\` definition and the meaning of target \`generate[i].name\` field.
Repeat the following 2 steps 3 times.
Step 1. Identify 1-3 informative entities (";" delimited) from the article which are missing from the previously generated summary.
Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the missing entities.
A missing entity is:
- relevant to the main story,
- specific yet concise (5 words or fewer),
- novel (not in the previous summary),
- faithful (present in the article),
- anywhere (can be located anywhere in the article).
Guidelines:
- The first summary should be long yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach ~80 words.
- Make every word count: rewrite the previous summary to improve flow and make space for additional entities.
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
- The summaries should become highly dense and concise yet self-contained, i.e., easily understood without the article.
- Missing entities can appear anywhere in the new summary.
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
Output: \`generatedField[i]\` with the resulting string.