Python使用PDFMiner解析PDF代碼實(shí)例

2020-02-23 04:28:56

字體：大中小

供稿：網(wǎng)友

近期在做爬蟲(chóng)時(shí)有時(shí)會(huì)遇到網(wǎng)站只提供pdf的情況，這樣就不能使用scrapy直接抓取頁(yè)面內(nèi)容了，只能通過(guò)解析PDF的方式處理，目前的解決方案大致只有pyPDF和PDFMiner。因?yàn)閾?jù)說(shuō)PDFMiner更適合文本的解析，而我需要解析的正是文本，因此最后選擇使用PDFMiner(這也就意味著我對(duì)pyPDF一無(wú)所知了)。

首先說(shuō)明的是解析PDF是非常蛋疼的事，即使是PDFMiner對(duì)于格式不工整的PDF解析效果也不怎么樣，所以連PDFMiner的開(kāi)發(fā)者都吐槽PDF is evil. 不過(guò)這些并不重要。官方文檔在此：http://www.unixuser.org/~euske/python/pdfminer/index.html

一.安裝：

1.首先下載源文件包 http://pypi.python.org/pypi/pdfminer/，解壓，然后命令行安裝即可：python setup.py install

2.安裝完成后使用該命令行測(cè)試：pdf2txt.py samples/simple1.pdf，如果顯示以下內(nèi)容則表示安裝成功：

Hello World Hello World H e l l o W o r l d H e l l o W o r l d

3.如果要使用中日韓文字則需要先編譯再安裝：　

# make cmappython tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txtreading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...writing 'CNS1_H.py'......(this may take several minutes) # python setup.py install

二．使用

由于解析PDF是一件非常耗時(shí)和內(nèi)存的工作，因此PDFMiner使用了一種稱作lazy parsing的策略，只在需要的時(shí)候才去解析，以減少時(shí)間和內(nèi)存的使用。要解析PDF至少需要兩個(gè)類：PDFParser 和 PDFDocument，PDFParser 從文件中提取數(shù)據(jù)，PDFDocument保存數(shù)據(jù)。另外還需要PDFPageInterpreter去處理頁(yè)面內(nèi)容，PDFDevice將其轉(zhuǎn)換為我們所需要的。PDFResourceManager用于保存共享內(nèi)容例如字體或圖片。

Figure 1. Relationships between PDFMiner classes

比較重要的是Layout，主要包括以下這些組件：

LTPage

Represents an entire page. May contain child objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.

LTTextBox

Represents a group of text chunks that can be contained in a rectangular area. Note that this box is created by geometric analysis and does not necessarily represents a logical boundary of the text. It contains a list of LTTextLine objects. get_text() method returns the text content.

LTTextLine

Contains a list of LTChar objects that represent a single text line. The characters are aligned either horizontaly or vertically, depending on the text's writing mode. get_text() method returns the text content.

LTChar

LTAnno

Represent an actual letter in the text as a Unicode string. Note that, while a LTChar object has actual boundaries, LTAnno objects does not, as these are "virtual" characters, inserted by a layout analyzer according to the relationship between two characters (e.g. a space).

上一篇：windows系統(tǒng)下Python環(huán)境搭建教程

下一篇：python3實(shí)現(xiàn)ftp服務(wù)功能（服務(wù)端 For Linux）