python爬蟲之xpath的基本使用詳解

2020-02-22 23:42:48

字體：大中小

來源：轉載

供稿：網友

一、簡介

XPath 是一門在 XML 文檔中查找信息的語言。XPath 可用來在 XML 文檔中對元素和屬性進行遍歷。XPath 是 W3C XSLT 標準的主要元素，并且 XQuery 和 XPointer 都構建于 XPath 表達之上。

二、安裝

pip3 install lxml

三、使用

1、導入

from lxml import etree

2、基本使用

from lxml import etreewb_data = """    <div>      <ul>         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>       </ul>     </div>    """html = etree.HTML(wb_data)print(html)result = etree.tostring(html)print(result.decode("utf-8"))

從下面的結果來看，我們打印機html其實就是一個python對象，etree.tostring(html)則是不全里html的基本寫法，補全了缺胳膊少腿的標簽。

 <Element html at 0x39e58f0><html><body><div>      <ul>         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>       </li></ul>     </div>    </body></html>

3、獲取某個標簽的內容(基本使用)，注意，獲取a標簽的所有內容，a后面就不用再加正斜杠，否則報錯。

寫法一

html = etree.HTML(wb_data)html_data = html.xpath('/html/body/div/ul/li/a')print(html)for i in html_data:  print(i.text)<Element html at 0x12fe4b8>first itemsecond itemthird itemfourth itemfifth item

上一篇：使用python讀取txt文件的內容,并刪除重復的行數方法

下一篇：對Python中range()函數和list的比較