Python抓取框架Scrapy爬蟲入門：頁面提取

2020-02-16 10:54:39

字體：大中小

來源：轉載

供稿：網友

前言

Scrapy是一個非常好的抓取框架，它不僅提供了一些開箱可用的基礎組建，還能夠根據自己的需求，進行強大的自定義。本文主要給大家介紹了關于Python抓取框架Scrapy之頁面提取的相關內容，分享出來供大家參考學習，下面隨著小編來一起學習學習吧。

在開始之前，關于scrapy框架的入門大家可以參考這篇文章：//www.jb51.net/article/87820.htm

下面創建一個爬蟲項目，以圖蟲網為例抓取圖片。

一、內容分析

打開圖蟲網，頂部菜單“發現” “標簽”里面是對各種圖片的分類，點擊一個標簽，比如“美女”，網頁的鏈接為：https://tuchong.com/tags/美女/，我們以此作為爬蟲入口，分析一下該頁面：

打開頁面后出現一個個的圖集，點擊圖集可全屏瀏覽圖片，向下滾動頁面會出現更多的圖集，沒有頁碼翻頁的設置。Chrome右鍵“檢查元素”打開開發者工具，檢查頁面源碼，內容部分如下：

<div class="content"> <div class="widget-gallery"> <ul class="pagelist-wrapper">  <li class="gallery-item...

可以判斷每一個li.gallery-item是一個圖集的入口，存放在ul.pagelist-wrapper下，div.widget-gallery是一個容器，如果使用 xpath 選取應該是：//div[@class="widget-gallery"]/ul/li，按照一般頁面的邏輯，在li.gallery-item下面找到對應的鏈接地址，再往下深入一層頁面抓取圖片。

但是如果用類似 Postman 的HTTP調試工具請求該頁面，得到的內容是：

<div class="content"> <div class="widget-gallery"></div></div>

也就是并沒有實際的圖集內容，因此可以斷定頁面使用了Ajax請求，只有在瀏覽器載入頁面時才會請求圖集內容并加入div.widget-gallery中，通過開發者工具查看XHR請求地址為：

https://tuchong.com/rest/tags/美女/posts?page=1&count=20&order=weekly&before_timestamp=

參數很簡單，page是頁碼，count是每頁圖集數量，order是排序，before_timestamp為空，圖蟲因為是推送內容式的網站，因此before_timestamp應該是一個時間值，不同的時間會顯示不同的內容，這里我們把它丟棄，不考慮時間直接從最新的頁面向前抓取。

請求結果為JSON格式內容，降低了抓取難度，結果如下：

{ "postList": [ { "post_id": "15624611", "type": "multi-photo", "url": "https://weishexi.tuchong.com/15624611/", "site_id": "443122", "author_id": "443122", "published_at": "2017-10-28 18:01:03", "excerpt": "10月18日", "favorites": 4052, "comments": 353, "rewardable": true, "parent_comments": "165", "rewards": "2", "views": 52709, "title": "微風不燥 秋意正好", "image_count": 15, "images": [ {  "img_id": 11585752,  "user_id": 443122,  "title": "",  "excerpt": "",  "width": 5016,  "height": 3840 }, {  "img_id": 11585737,  "user_id": 443122,  "title": "",  "excerpt": "",  "width": 3840,  "height": 5760 }, ... ], "title_image": null, "tags": [ {  "tag_id": 131,  "type": "subject",  "tag_name": "人像",  "event_type": "",  "vote": "" }, {  "tag_id": 564,  "type": "subject",  "tag_name": "美女",  "event_type": "",  "vote": "" } ], "favorite_list_prefix": [], "reward_list_prefix": [], "comment_list_prefix": [], "cover_image_src": "https://photo.tuchong.com/443122/g/11585752.webp", "is_favorite": false } ], "siteList": {...}, "following": false, "coverUrl": "https://photo.tuchong.com/443122/ft640/11585752.webp", "tag_name": "美女", "tag_id": "564", "url": "https://tuchong.com/tags/%E7%BE%8E%E5%A5%B3/", "more": true, "result": "SUCCESS"}

上一篇：Python實現的計數排序算法示例

下一篇：Python簡單讀取json文件功能示例