近期在做爬蟲(chóng)時(shí)有時(shí)會(huì)遇到網(wǎng)站只提供pdf的情況,這樣就不能使用scrapy直接抓取頁(yè)面內(nèi)容了,只能通過(guò)解析PDF的方式處理,目前的解決方案大致只有pyPDF和PDFMiner。因?yàn)閾?jù)說(shuō)PDFMiner更適合文本的解析,而我需要解析的正是文本,因此最后選擇使用PDFMiner(這也就意味著我對(duì)pyPDF一無(wú)所知了)。
首先說(shuō)明的是解析PDF是非常蛋疼的事,即使是PDFMiner對(duì)于格式不工整的PDF解析效果也不怎么樣,所以連PDFMiner的開(kāi)發(fā)者都吐槽PDF is evil. 不過(guò)這些并不重要。官方文檔在此:http://www.unixuser.org/~euske/python/pdfminer/index.html
一.安裝:
1.首先下載源文件包 http://pypi.python.org/pypi/pdfminer/,解壓,然后命令行安裝即可:python setup.py install
2.安裝完成后使用該命令行測(cè)試:pdf2txt.py samples/simple1.pdf,如果顯示以下內(nèi)容則表示安裝成功:
Hello World Hello World H e l l o W o r l d H e l l o W o r l d
3.如果要使用中日韓文字則需要先編譯再安裝:
# make cmappython tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txtreading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...writing 'CNS1_H.py'......(this may take several minutes) # python setup.py install
二.使用
由于解析PDF是一件非常耗時(shí)和內(nèi)存的工作,因此PDFMiner使用了一種稱作lazy parsing的策略,只在需要的時(shí)候才去解析,以減少時(shí)間和內(nèi)存的使用。要解析PDF至少需要兩個(gè)類:PDFParser 和 PDFDocument,PDFParser 從文件中提取數(shù)據(jù),PDFDocument保存數(shù)據(jù)。另外還需要PDFPageInterpreter去處理頁(yè)面內(nèi)容,PDFDevice將其轉(zhuǎn)換為我們所需要的。PDFResourceManager用于保存共享內(nèi)容例如字體或圖片。
Figure 1. Relationships between PDFMiner classes
比較重要的是Layout,主要包括以下這些組件:
LTPage
Represents an entire page. May contain child objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.
LTTextBox
Represents a group of text chunks that can be contained in a rectangular area. Note that this box is created by geometric analysis and does not necessarily represents a logical boundary of the text. It contains a list of LTTextLine objects. get_text() method returns the text content.
LTTextLine
Contains a list of LTChar objects that represent a single text line. The characters are aligned either horizontaly or vertically, depending on the text's writing mode. get_text() method returns the text content.
LTChar
LTAnno
Represent an actual letter in the text as a Unicode string. Note that, while a LTChar object has actual boundaries, LTAnno objects does not, as these are "virtual" characters, inserted by a layout analyzer according to the relationship between two characters (e.g. a space).
新聞熱點(diǎn)
疑難解答
圖片精選