【AIGC学习笔记】——GraphRAG当地摆设

打印 上一主题 下一主题

主题 884|帖子 884|积分 2652

【GraphRAG+Ollama当地摆设】奇怪滚热辣的小白利用



  

情况准备+源码下载

1.利用系统:Ubuntu20.04
2.VSCode编译器
3.Python 3.11
4.Ollama安装
5.GraphRAG代码下载:https://github.com/microsoft/graphrag.git
一、Anaconda虚拟情况创建及配置

1.conda create --name graphR python=3.11
2.pip install graphrag==0.3.6(新版本的graphrag轻易报错:No module named graphrag.index.main
3.pip install ollama
(接下来利用期间若出现未提及包的未安装提示,就按照提示安装即可)
二、Ollama模型下载

1.ollama serve (启动ollama)
2.ollama pull mistral:v0.2
3.ollama pull nomic-embed-text:latest
三、创建数据目次

在graphrag代码文件夹中创建一个ragtest文件夹,并在ragtest文件夹中创建一个input文件夹,把txt数据放在input文件夹中
四、项目初始化

1.python -m graphrag.index --init --root ./ragtest
(在graphrag代码文件夹中打开终端,并激活graphR虚拟情况,运行上述代码,随后在ragtest文件夹中会出现setting.yaml,prompts,.env等文件,正常情况下是有图中6个东西的,但是有时候一开始只有几个,背面创建索引的时候其余的也会主动天生)

四、修改配置文件settings.yaml

按照下面的代码修改即可
  1. encoding_model: cl100k_base
  2. skip_workflows: []
  3. llm:
  4.   api_key: ollama
  5.   type: openai_chat # or azure_openai_chat
  6.   model: mistral:v0.2
  7.   model_supports_json: true # recommended if this is available for your model.
  8.   max_tokens: 1024
  9.   # request_timeout: 180.0
  10.   api_base: http://localhost:11434/v1
  11.   # api_version: 2024-02-15-preview
  12.   # organization: <organization_id>
  13.   # deployment_name: <azure_model_deployment_name>
  14.   # tokens_per_minute: 150_000 # set a leaky bucket throttle
  15.   # requests_per_minute: 10_000 # set a leaky bucket throttle
  16.   # max_retries: 10
  17.   # max_retry_wait: 10.0
  18.   # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  19.   # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  20.   # temperature: 0 # temperature for sampling
  21.   # top_p: 1 # top-p sampling
  22.   # n: 1 # Number of completions to generate
  23. parallelization:
  24.   stagger: 0.3
  25.   # num_threads: 50 # the number of threads to use for parallel processing
  26. async_mode: threaded # or asyncio
  27. embeddings:
  28.   ## parallelization: override the global parallelization settings for embeddings
  29.   async_mode: threaded # or asyncio
  30.   # target: required # or all
  31.   # batch_size: 16 # the number of documents to send in a single request
  32.   # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
  33.   llm:
  34.     api_key: ollama
  35.     type: openai_embedding # or azure_openai_embedding
  36.     model: nomic-embed-text:latest
  37.     api_base: http://localhost:11434/api
  38.     # api_version: 2024-02-15-preview
  39.     # organization: <organization_id>
  40.     # deployment_name: <azure_model_deployment_name>
  41.     # tokens_per_minute: 150_000 # set a leaky bucket throttle
  42.     # requests_per_minute: 10_000 # set a leaky bucket throttle
  43.     # max_retries: 10
  44.     # max_retry_wait: 10.0
  45.     # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  46.     # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  47. chunks:
  48.   size: 200
  49.   overlap: 100
  50.   group_by_columns: [id] # by default, we don't allow chunks to cross documents
  51. input:
  52.   type: file # or blob
  53.   file_type: text # or csv
  54.   base_dir: "input"
  55.   file_encoding: utf-8
  56.   file_pattern: ".*\\.txt$"
  57. cache:
  58.   type: file # or blob
  59.   base_dir: "cache"
  60.   # connection_string: <azure_blob_storage_connection_string>
  61.   # container_name: <azure_blob_storage_container_name>
  62. storage:
  63.   type: file # or blob
  64.   base_dir: "output/${timestamp}/artifacts"
  65.   # connection_string: <azure_blob_storage_connection_string>
  66.   # container_name: <azure_blob_storage_container_name>
  67. reporting:
  68.   type: file # or console, blob
  69.   base_dir: "output/${timestamp}/reports"
  70.   # connection_string: <azure_blob_storage_connection_string>
  71.   # container_name: <azure_blob_storage_container_name>
  72. entity_extraction:
  73.   ## strategy: fully override the entity extraction strategy.
  74.   ##   type: one of graph_intelligence, graph_intelligence_json and nltk
  75.   ## llm: override the global llm settings for this task
  76.   ## parallelization: override the global parallelization settings for this task
  77.   ## async_mode: override the global async_mode settings for this task
  78.   prompt: "prompts/entity_extraction.txt"
  79.   entity_types: [organization,person,geo,event]
  80.   max_gleanings: 0
  81. summarize_descriptions:
  82.   ## llm: override the global llm settings for this task
  83.   ## parallelization: override the global parallelization settings for this task
  84.   ## async_mode: override the global async_mode settings for this task
  85.   prompt: "prompts/summarize_descriptions.txt"
  86.   max_length: 500
  87. claim_extraction:
  88.   ## llm: override the global llm settings for this task
  89.   ## parallelization: override the global parallelization settings for this task
  90.   ## async_mode: override the global async_mode settings for this task
  91.   # enabled: true
  92.   prompt: "prompts/claim_extraction.txt"
  93.   description: "Any claims or facts that could be relevant to information discovery."
  94.   max_gleanings: 0
  95. community_reports:
  96.   ## llm: override the global llm settings for this task
  97.   ## parallelization: override the global parallelization settings for this task
  98.   ## async_mode: override the global async_mode settings for this task
  99.   prompt: "prompts/community_report.txt"
  100.   max_length: 2000
  101.   max_input_length: 8000
  102. cluster_graph:
  103.   max_cluster_size: 10
  104. embed_graph:
  105.   enabled: false # if true, will generate node2vec embeddings for nodes
  106.   # num_walks: 10
  107.   # walk_length: 40
  108.   # window_size: 2
  109.   # iterations: 3
  110.   # random_seed: 597832
  111. umap:
  112.   enabled: false # if true, will generate UMAP embeddings for nodes
  113. snapshots:
  114.   graphml: false
  115.   raw_entities: false
  116.   top_level_nodes: false
  117. local_search:
  118.   # text_unit_prop: 0.5
  119.   # community_prop: 0.1
  120.   # conversation_history_max_turns: 5
  121.   # top_k_mapped_entities: 10
  122.   # top_k_relationships: 10
  123.   # llm_temperature: 0 # temperature for sampling
  124.   # llm_top_p: 1 # top-p sampling
  125.   # llm_n: 1 # Number of completions to generate
  126.   # max_tokens: 12000
  127. global_search:
  128.   # llm_temperature: 0 # temperature for sampling
  129.   # llm_top_p: 1 # top-p sampling
  130.   # llm_n: 1 # Number of completions to generate
  131.   # max_tokens: 12000
  132.   # data_max_tokens: 12000
  133.   # map_max_tokens: 1000
  134.   # reduce_max_tokens: 2000
  135.   # concurrency: 32
复制代码
五、修改.env文件

按照下面的代码修改即可
  1. GRAPHRAG_API_KEY=ollama
  2. GRAPHRAG_CLAIM_EXTRACTION_ENABLED=True
复制代码
六、修改虚拟情况graphR里graphrag包的源码

1.找到graphrag包位置:/home//.conda/envs/graphR/lib/python3.11/site-packages
2.修改第一个文件:/home/
/.conda/envs/graphR/lib/python3.11/site-packages/graphrag/llm/openai/openai_embeddings_llm.py
  1. """The EmbeddingsLLM class."""
  2. from typing_extensions import Unpack
  3. from graphrag.llm.base import BaseLLM
  4. from graphrag.llm.types import (
  5.     EmbeddingInput,
  6.     EmbeddingOutput,
  7.     LLMInput,
  8. )
  9. from .openai_configuration import OpenAIConfiguration
  10. from .types import OpenAIClientTypes
  11. import ollama # 增加依赖
  12. class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
  13.     """A text-embedding generator LLM."""
  14.     _client: OpenAIClientTypes
  15.     _configuration: OpenAIConfiguration
  16.     def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
  17.         self.client = client
  18.         self.configuration = configuration
  19.     async def _execute_llm(
  20.         self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
  21.     ) -> EmbeddingOutput | None:
  22.         args = {
  23.             "model": self.configuration.model,
  24.             **(kwargs.get("model_parameters") or {}),
  25.         }
  26.         # 修改此处
  27.         #embedding = await self.client.embeddings.create(
  28.         #    input=input,
  29.         #    **args,
  30.         #)
  31.         #return [d.embedding for d in embedding.data]
  32.         
  33.         embedding_list = []
  34.         for inp in input:
  35.             embedding = ollama.embeddings(model="nomic-embed-text:latest", prompt=inp)
  36.             embedding_list.append(embedding["embedding"])
  37.         return embedding_list
复制代码
3.修改第二个文件:/home/***/.conda/envs/graphR/lib/python3.11/site-packages/graphrag/query/llm/oai/embedding.py
  1. """OpenAI Embedding model implementation."""
  2. import asyncio
  3. from collections.abc import Callable
  4. from typing import Any
  5. import numpy as np
  6. import tiktoken
  7. from tenacity import (
  8.     AsyncRetrying,
  9.     RetryError,
  10.     Retrying,
  11.     retry_if_exception_type,
  12.     stop_after_attempt,
  13.     wait_exponential_jitter,
  14. )
  15. from graphrag.logging import StatusLogger
  16. from graphrag.query.llm.base import BaseTextEmbedding
  17. from graphrag.query.llm.oai.base import OpenAILLMImpl
  18. from graphrag.query.llm.oai.typing import (
  19.     OPENAI_RETRY_ERROR_TYPES,
  20.     OpenaiApiType,
  21. )
  22. from graphrag.query.llm.text_utils import chunk_text
  23. # 增加依赖
  24. import ollama
  25. class OpenAIEmbedding(BaseTextEmbedding, OpenAILLMImpl):
  26.     """Wrapper for OpenAI Embedding models."""
  27.     def __init__(
  28.         self,
  29.         api_key: str | None = None,
  30.         azure_ad_token_provider: Callable | None = None,
  31.         model: str = "text-embedding-3-small",
  32.         deployment_name: str | None = None,
  33.         api_base: str | None = None,
  34.         api_version: str | None = None,
  35.         api_type: OpenaiApiType = OpenaiApiType.OpenAI,
  36.         organization: str | None = None,
  37.         encoding_name: str = "cl100k_base",
  38.         max_tokens: int = 8191,
  39.         max_retries: int = 10,
  40.         request_timeout: float = 180.0,
  41.         retry_error_types: tuple[type[BaseException]] = OPENAI_RETRY_ERROR_TYPES,  # type: ignore
  42.         reporter: StatusLogger | None = None,
  43.     ):
  44.         OpenAILLMImpl.__init__(
  45.             self=self,
  46.             api_key=api_key,
  47.             azure_ad_token_provider=azure_ad_token_provider,
  48.             deployment_name=deployment_name,
  49.             api_base=api_base,
  50.             api_version=api_version,
  51.             api_type=api_type,  # type: ignore
  52.             organization=organization,
  53.             max_retries=max_retries,
  54.             request_timeout=request_timeout,
  55.             reporter=reporter,
  56.         )
  57.         self.model = model
  58.         self.encoding_name = encoding_name
  59.         self.max_tokens = max_tokens
  60.         self.token_encoder = tiktoken.get_encoding(self.encoding_name)
  61.         self.retry_error_types = retry_error_types
  62.     def embed(self, text: str, **kwargs: Any) -> list[float]:
  63.         """
  64.         Embed text using OpenAI Embedding's sync function.
  65.         For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
  66.         Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
  67.         """
  68.         token_chunks = chunk_text(
  69.             text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
  70.         )
  71.         chunk_embeddings = []
  72.         chunk_lens = []
  73.         for chunk in token_chunks:
  74.             try:
  75.                 #embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
  76.                 #修改embedding、chunk_len
  77.                 embedding = ollama.embeddings(model='nomic-embed-text:latest', prompt=chunk)['embedding']
  78.                 chunk_len = len(chunk)
  79.                 chunk_embeddings.append(embedding)
  80.                 chunk_lens.append(chunk_len)
  81.             # TODO: catch a more specific exception
  82.             except Exception as e:  # noqa BLE001
  83.                 self._reporter.error(
  84.                     message="Error embedding chunk",
  85.                     details={self.__class__.__name__: str(e)},
  86.                 )
  87.                 continue
  88.         #chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
  89.         #chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
  90.         #return chunk_embeddings.tolist()
  91.         return chunk_embeddings
  92.    
  93.     async def aembed(self, text: str, **kwargs: Any) -> list[float]:
  94.         """
  95.         Embed text using OpenAI Embedding's async function.
  96.         For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
  97.         """
  98.         token_chunks = chunk_text(
  99.             text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
  100.         )
  101.         chunk_embeddings = []
  102.         chunk_lens = []
  103.         embedding_results = await asyncio.gather(*[
  104.             self._aembed_with_retry(chunk, **kwargs) for chunk in token_chunks
  105.         ])
  106.         embedding_results = [result for result in embedding_results if result[0]]
  107.         chunk_embeddings = [result[0] for result in embedding_results]
  108.         chunk_lens = [result[1] for result in embedding_results]
  109.         chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)  # type: ignore
  110.         chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
  111.         return chunk_embeddings.tolist()
  112.     def _embed_with_retry(
  113.         self, text: str | tuple, **kwargs: Any
  114.     ) -> tuple[list[float], int]:
  115.         try:
  116.             retryer = Retrying(
  117.                 stop=stop_after_attempt(self.max_retries),
  118.                 wait=wait_exponential_jitter(max=10),
  119.                 reraise=True,
  120.                 retry=retry_if_exception_type(self.retry_error_types),
  121.             )
  122.             for attempt in retryer:
  123.                 with attempt:
  124.                     embedding = (
  125.                         self.sync_client.embeddings.create(  # type: ignore
  126.                             input=text,
  127.                             model=self.model,
  128.                             **kwargs,  # type: ignore
  129.                         )
  130.                         .data[0]
  131.                         .embedding
  132.                         or []
  133.                     )
  134.                     return (embedding, len(text))
  135.         except RetryError as e:
  136.             self._reporter.error(
  137.                 message="Error at embed_with_retry()",
  138.                 details={self.__class__.__name__: str(e)},
  139.             )
  140.             return ([], 0)
  141.         else:
  142.             # TODO: why not just throw in this case?
  143.             return ([], 0)
  144.     async def _aembed_with_retry(
  145.         self, text: str | tuple, **kwargs: Any
  146.     ) -> tuple[list[float], int]:
  147.         try:
  148.             retryer = AsyncRetrying(
  149.                 stop=stop_after_attempt(self.max_retries),
  150.                 wait=wait_exponential_jitter(max=10),
  151.                 reraise=True,
  152.                 retry=retry_if_exception_type(self.retry_error_types),
  153.             )
  154.             async for attempt in retryer:
  155.                 with attempt:
  156.                     embedding = (
  157.                         await self.async_client.embeddings.create(  # type: ignore
  158.                             input=text,
  159.                             model=self.model,
  160.                             **kwargs,  # type: ignore
  161.                         )
  162.                     ).data[0].embedding or []
  163.                     return (embedding, len(text))
  164.         except RetryError as e:
  165.             self._reporter.error(
  166.                 message="Error at embed_with_retry()",
  167.                 details={self.__class__.__name__: str(e)},
  168.             )
  169.             return ([], 0)
  170.         else:
  171.             # TODO: why not just throw in this case?
  172.             return ([], 0)
复制代码
4.修改第三个文件:/home/***/.conda/envs/graphR/lib/python3.11/site-packages/graphrag/query/llm/text_utils.py
  1. """Text Utilities for LLM."""
  2. from collections.abc import Iterator
  3. from itertools import islice
  4. import tiktoken
  5. def num_tokens(text: str, token_encoder: tiktoken.Encoding | None = None) -> int:
  6.     """Return the number of tokens in the given text."""
  7.     if token_encoder is None:
  8.         token_encoder = tiktoken.get_encoding("cl100k_base")
  9.     return len(token_encoder.encode(text))  # type: ignore
  10. def batched(iterable: Iterator, n: int):
  11.     """
  12.     Batch data into tuples of length n. The last batch may be shorter.
  13.     Taken from Python's cookbook: https://docs.python.org/3/library/itertools.html#itertools.batched
  14.     """
  15.     # batched('ABCDEFG', 3) --> ABC DEF G
  16.     if n < 1:
  17.         value_error = "n must be at least one"
  18.         raise ValueError(value_error)
  19.     it = iter(iterable)
  20.     while batch := tuple(islice(it, n)):
  21.         yield batch
  22. def chunk_text(
  23.     text: str, max_tokens: int, token_encoder: tiktoken.Encoding | None = None
  24. ):
  25.     """Chunk text by token length."""
  26.     if token_encoder is None:
  27.         token_encoder = tiktoken.get_encoding("cl100k_base")
  28.     tokens = token_encoder.encode(text)  # type: ignore
  29.     # 增加下行代码,将tokens解码成字符串
  30.     tokens = token_encoder.decode(tokens)
  31.     chunk_iterator = batched(iter(tokens), max_tokens)
  32.     #yield from (token_encoder.decode(list(chunk)) for chunk in chunk_iterator)
  33.     yield from chunk_iterator
复制代码
七、创建索引

1.打开ollama,可以看到模型运行状态
2.python -m graphrag.index --root ./ragtest
(回到第四步打开的终端页面,运行这行代码,并出现图中效果即创建完成)

八、开始查询

1.局部查询:python -m graphrag.query --root ./ragtest --method local "who is Marley?"

2.全局查询:python -m graphrag.query --root ./ragtest --method global "who is Marley?"

(留意:我在查询的时候会出现没有graphrag.logging的错误提示,直接把源码中graphrag文件夹里面的logging文件夹复制到虚拟情况中graphrag包里相应位置就解决了)
参考博文:
1.https://blog.csdn.net/weixin_42107217/article/details/141649920
2.https://blog.csdn.net/gaotianhao123/article/details/140640415
3.https://blog.csdn.net/m0_56378800/article/details/140319467

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

x
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

民工心事

金牌会员
这个人很懒什么都没写!

标签云

快速回复 返回顶部 返回列表