ToB企服应用市场:ToB评测及商务社交产业平台
标题:
pymupdf 剖析 PDF
[打印本页]
作者:
曂沅仴駦
时间:
2024-10-9 07:41
标题:
pymupdf 剖析 PDF
使用大模型处置惩罚文档时,必要对二进制格式的文档进转剖析提取文字和图片,本文使用 pymupdf 开源库,对 PDF 举行剖析提取文字和图片。
安装依赖
首先安装 pymupdf 依赖
pymupdf4llm==0.0.17
pymupdf==1.24.10
apscheduler==3.10.4
复制代码
PDF 转 Markdown
pymupdf 将 PDF 转换为 markdown,传入 pdf 路径举行处置惩罚。
def convert_to_md(pdf_path):
return pymupdf4llm.to_markdown(pdf_path)
复制代码
PDF 中提取图片
传入 PDF 路径,保存图片,并返回路径列表。
def extract_image(pdf_path):
doc = pymupdf.open(pdf_path) # open a document
image_path_list = []
rad_prefix = ''.join(random.choices(string.ascii_letters, k=5))
Path(f"images/{rad_prefix}").mkdir(parents=True)
for page_index in range(len(doc)): # iterate over pdf pages
page = doc[page_index] # get the page
image_list = page.get_images()
# print the number of images found on the page
if image_list:
print(f"Found {len(image_list)} images on page {page_index}")
else:
print("No images found on page", page_index)
for image_index, img in enumerate(image_list, start=1): # enumerate the image list
xref = img[0] # get the XREF of the image
pix = pymupdf.Pixmap(doc, xref) # create a Pixmap
if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
image_path = Path("images/%s/page_%s-image_%s.png" % (rad_prefix,page_index, image_index)).absolute()
image_path_list.append(image_path)
pix.pil_save(image_path) # save the image as png
pix = None
return image_path_list
复制代码
可视化
通过 Gradio 可视化 PDF 转换。
import gradio as gr
import pdf
def covnert_pdf(x):
print(x)
md_content = pdf.convert_to_md(x)
image_list = pdf.extract_image(x)
return [md_content , image_list]
with gr.Blocks() as demo:
gr.Markdown(
value="# PDF 提取"
)
with gr.Tab("Markdown"):
md = gr.Markdown(
height=500
)
with gr.Tab("Images"):
gallery = gr.Gallery(columns=4, object_fit="None", height="auto", show_download_button=True)
with gr.Row():
go_btn = gr.UploadButton()
go_btn.upload(covnert_pdf, go_btn, [md, gallery])
from apscheduler.schedulers.background import BackgroundScheduler
demo.launch()
复制代码
总结
pymupdf 转 markdown 的结果还是不错的,可以到 Modelscope 的空间举行体验。https://modelscope.cn/studios/model1001/pdf_converter
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
欢迎光临 ToB企服应用市场:ToB评测及商务社交产业平台 (https://dis.qidao123.com/)
Powered by Discuz! X3.4