批量将 Word 文件转换为 HTML：Python 实现指南

立山 · 前天 20:31

概述

在日常工作中，我们大概会遇到将大量 Word 文档（.docx）转换为 HTML 文件的需求，比如为了将文档内容展示到网页上，大概实现文档在线阅读功能。今天，我们将分享一个用 Python 编写的实用工具，支持将整个文件夹下的 Word 文件批量转换为 HTML，同时保存文档的样式，如段落缩进、加粗、斜体等。
工具功能

支持单个 Word 文件到 HTML 的转换。
批量处理文件夹中的 Word 文件。
保存段落样式（如段落缩进、首行缩进、左右边距等）。
支持加粗、斜体、下划线等文本样式。
支持 Word 文档中的表格内容转换。
实当代码

以下是完整的实当代码：

import os
from docx import Document
from html import escape
def docx_to_html(docx_path):
"""将单个 Word 文件转换为 HTML，保留换行、段落、缩进等格式"""
document = Document(docx_path)
html_content = "<html><head><meta charset='utf-8'></head><body>"
for paragraph in document.paragraphs:
# 获取段落的样式
left_indent = paragraph.paragraph_format.left_indent
right_indent = paragraph.paragraph_format.right_indent
first_line_indent = paragraph.paragraph_format.first_line_indent
# 样式转换为 HTML 的 inline 样式
styles = []
if left_indent:
styles.append(f"margin-left: {int(left_indent.pt * 1.33)}px;")
if right_indent:
styles.append(f"margin-right: {int(right_indent.pt * 1.33)}px;")
if first_line_indent:
styles.append(f"text-indent: {int(first_line_indent.pt * 1.33)}px;")
style_attribute = f" style='{' '.join(styles)}'" if styles else ""
# 转换加粗、斜体等样式
content = ""
for run in paragraph.runs:
run_text = escape(run.text)
if run.bold:
run_text = f"<b>{run_text}</b>"
if run.italic:
run_text = f"<i>{run_text}</i>"
if run.underline:
run_text = f"<u>{run_text}</u>"
content += run_text
# 包裹为段落
html_content += f"<p{style_attribute}>{content}</p>"
# 处理表格
for table in document.tables:
html_content += "<table border='1' style='border-collapse: collapse; width: 100%;'>"
for row in table.rows:
html_content += "<tr>"
for cell in row.cells:
html_content += f"<td>{escape(cell.text)}</td>"
html_content += "</tr>"
html_content += "</table>"
html_content += "</body></html>"
return html_content
def batch_convert_to_html(input_folder, output_folder):
"""批量将文件夹下的 docx 文档转换为 HTML 文件"""
if not os.path.exists(output_folder):
os.makedirs(output_folder)
for filename in os.listdir(input_folder):
if filename.endswith(".docx"):
input_path = os.path.join(input_folder, filename)
output_path = os.path.join(output_folder, f"{os.path.splitext(filename)[0]}.html")
try:
html_content = docx_to_html(input_path)
with open(output_path, 'w', encoding='utf-8') as html_file:
html_file.write(html_content)
print(f"成功转换: {filename} -> {output_path}")
except Exception as e:
print(f"转换失败: {filename}, 错误信息: {e}")
# 设置输入和输出文件夹路径
input_folder = r"input_path" # 替换为存储 Word 文档的文件夹路径
output_folder = r"output_path" # 替换为存储 HTML 文件的文件夹路径
# 批量转换
batch_convert_to_html(input_folder, output_folder)

复制代码

代码剖析

1. docx_to_html 函数

功能：将单个 Word 文件转换为 HTML。
剖析段落样式：
– 使用 paragraph_format 获取段落的左缩进、右缩进和首行缩进。
– 转换为 HTML 的内联样式。
– 转换文本样式：
– 剖析 run 对象的加粗、斜体和下划线样式，生成对应的 HTML 标签。
处理表格：
– 将 Word 表格转换为带边框的 HTML 表格。

2. batch_convert_to_html 函数

功能：批量处理文件夹下的 Word 文件。
主动创建输出文件夹。
遍历输入文件夹下的 .docx 文件，并逐个调用 docx_to_html 函数。
将生成的 HTML 文件存储到输出文件夹。

3. 主程序

设置输入文件夹和输出文件夹路径。
调用 batch_convert_to_html 完成批量转换。

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

批量将 Word 文件转换为 HTML：Python 实现指南

0 个回复

快速回复

楼主热帖

标签云