论文《Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond》通过图1所示的树状图详细列举了自2018年以来自然语言大模子(LLM)这一领域的发展路线和相应的各大模子,此中一部门是在Transformer出现之前、不基于Transformer的大模子,比方AI2的ELMo,另一大部门是在Transfomer出现之后、基于Transformer的大模子,其又分为三个发展路线:
此中,对于知识库文档,笔者使用《超等汇川程序化创意产物手册》这一文档,将其以Markdown格式下载至本地,使用UnstructuredMarkdownLoader举行加载,并使用MarkdownTextSplitter举行切分得到文本段。对于向量化模子,笔者使用HuggingFace上的GanymedeNil/text2vec-large-chinese,并下载至本地:
cd ~/workspace/models/ git lfs install #若ChatGLM-6B部门已执行,则无需再执行 git clone huggingface.co/GanymedeNil…
对于向量索引引擎,笔者使用Chroma;对于大语言模子,笔者使用之前已界说的ChatGLM2。对于问题和从向量索引返回的相关文本段,RetrievalQA按下述提示模板拼接提示:
Use the following pieces of context to answer the question at the end. If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.
{context}
Question: {question} Helpful Answer:
retrieval_qa_demo.py运行结果如图10所示。
此中,对于大语言模子,先尝试使用之前已界说的ChatGLM2,背面会分析,从执行结果看,ChatGLM2-6B-INT4和ChatGLM2-6B并不能输出符合格式的答案,从而无法进一步从中提取出查询SQL,以是通过FakeListLLM直接使用固定的答案,而这些答案事先根据提示由OpenAI ChatGPT3.5给出。 对于数据库引擎,使用SQLite3(Macbook原生支持),对于数据库实例,使用Chinook,可按照上述链接中的阐明下载“Chinook_Sqlite.sql”并在本地创建数据库实例。Chinook表示一个数字多媒体市肆,包含了顾客(Customer)、雇员(Employee)、歌曲(Track)、订单(Invoice)及其相关的表和数据,如图12所示。问题是“How many employees are there?”,即有多少雇员,期望模子先给出查询Employee表记录数的SQL,再根据查询结果给出终极的答案。
实际执行时,SQLDatabaseChain首先根据问题和数据库Schema生成如下的提示:
You are a SQLite expert. Given an input question, first create a syntactically correct SQLite query to run, then look at the results of the query and return the answer to the input question. Unless the user specifies in the question a specific number of examples to obtain, query for at most 5 results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database. Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers. Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table. Pay attention to use date(‘now’) function to get the current date, if the question involves “today”. Use the following format: Question: Question here SQLQuery: SQL Query to run SQLResult: Result of the SQLQuery Answer: Final answer here Only use the following tables: {数据库Schema,包含全部表的建表语句和数据示例,受限于篇幅,这里略去} Question: How many employees are there? SQLQuery:
此中,提示的第一部门是指示,期望模子作为SQLite的专家,按照肯定的要求举行推理,并按照肯定的格式输出,第二部门是数据库Schema,第三部门是问题以及期望输出的开头“SQLQuery:”,预期模子按照提示续写,给出查询SQL。 若将提示输入ChatGPT3.5,可以返回预期的答案,SQLDatabaseChain进一步提取答案中“\nSQLResult”之前的部门,从而得到查询SQL:
SELECT COUNT() FROM Employee SQLResult: COUNT() 8 Answer: There are 8 employees.
若将提示输入自界说的ChatGLM2(使用ChatGLM2-6B-INT4),则无法返回预期的答案(答案合理、但不符合格式要求):
SQLite is a language for creating and managing databases. It does not have an SQL-specific version for getting the number of employees. However, I can provide you with an SQL query that you can run using a SQLite database to get the number of employees in the “Employee” table.
SQLite:
SELECT COUNT(*) as num_employees FROM Employee;
复制代码
This query will return the count of employees in the “Employee” table. The result will be returned in a single row with a single column, labeled “num_employees”.
SQLDatabaseChain的提示是针对ChatGPT逐步优化、确定的,因此适用于ChatGPT,LangChain官方示例中使用的大语言模子是OpenAI,即底层调用ChatGPT,而ChatGLM2-6B-INT4、ChatGLM2-6B相对于ChatGPT,模子规模较小,仅有60亿参数,对于上述的长文本提示无法给出预期的答案。由于没有OpenAI的Token,因此示例代码通过FakeListLLM直接使用由ChatGPT3.5给出的答案。 在获取查询SQL后,SQLDatabaseChain会执行该SQL获取查询结果,并继续根据问题、数据库Schema、查询SQL和查询结果生成如下的提示:
You are a SQLite expert. Given an input question, first create a syntactically correct SQLite query to run, then look at the results of the query and return the answer to the input question. Unless the user specifies in the question a specific number of examples to obtain, query for at most 5 results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database. Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers. Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table. Pay attention to use date(‘now’) function to get the current date, if the question involves “today”. Use the following format: Question: Question here SQLQuery: SQL Query to run SQLResult: Result of the SQLQuery Answer: Final answer here Only use the following tables: {数据库Schema,包含全部表的建表语句和数据示例,受限于篇幅,这里略去} Question: How many employees are there? SQLQuery:SELECT COUNT(EmployeeId) FROM Employee SQLResult: [(8,)] Answer:
相比上次提示,本次提示只是在末端追加了查询SQL和查询结果,若将提示输入ChatGPT3.5,则可以续写“Answer”,给出正确的答案:
There are 8 employees.
这里也通过FakeListLLM直接使用由ChatGPT3.5给出的答案,从而在本地跑通SQLDatabaseChain的流程,运行结果如图13所示。