两行代码读取pdf、docx文件

栏目: IT技术 · 发布时间: 4年前

两行代码读取pdf、docx文件

最近运行课件代码，发现pdf文件读取部分的函数失效。这里找到读取pdf文件的可运行代码，为了方便后续学习使用，我已将pdf和docx读取方法封装成pdfdocx包。

pdfdocx

只有简单的两个读取函数

read_pdf(file)
read_docx(file)

file为文件路径，函数运行后返回file文件内的文本数据。

安装

pip install pdfdocx

使用

读取pdf文件

from pdfdocx import read_pdf
p_text = read_pdf('test/data.pdf')
print(p_text)

Run

这是来⾃pdf⽂件内的内容

from pdfdocx import read_docx
d_text = read_pdf('test/data.docx')
print(d_text)

Run

这是来⾃docx⽂件内的内容

拆开pdfdocx

希望大家能安装好，如果安装或者使用失败，可以使用下面的代码作为备选方案

读取pdf

from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
import re


def read_pdf(file):
    """
    读取pdf文件，并返回其中的文本内容
    :param file: pdf文件路径
    :return: docx中的文本内容
    """
    output_string = StringIO()
    with open(file, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
    text = output_string.getvalue()
    return text

读取docx

import docx
  
def read_docx(file):
    """
    读取docx文件，并返回其中的文本内容
    :param file: docx文件路径
    :return: docx中的文本内容
    """
    text = ''
    doc = docx.Document(file)
    for para in doc.paragraphs:
        text += para.text
    return text

精彩回顾

LabelStudio多媒体数据标注工具[5星推荐]

如何批量下载上海证券交易所上市公司年报

Loughran&McDonald金融文本情感分析库

如何使用 Python 快速构建领域内情感词典

Python数据分析相关学习资源汇总帖

漂亮~pandas可以无缝衔接Bokeh

YelpDaset: 酒店管理类数据集10+G

万水千山总是情，给我点好看可好❤

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

The C Programming Language

Brian W. Kernighan、Dennis M. Ritchie / Prentice Hall / 1988-4-1 / USD 67.00

Presents a complete guide to ANSI standard C language programming. Written by the developers of C, this new version helps readers keep up with the finalized ANSI standard for C while showing how to ta......一起来看看《The C Programming Language》这本书的介绍吧!

码农工具

Base64 编码/解码

HEX HSV 转换工具

HEX HSV 互换工具

HSV CMYK 转换工具

HSV CMYK互换工具