国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Python > 正文

python實現博客文章爬蟲示例

2020-02-23 05:13:07
字體:
來源:轉載
供稿:網友

代碼如下:
#!/usr/bin/python
#-*-coding:utf-8-*-
# JCrawler
# Author: Jam <810441377@qq.com>

import time
import urllib2
from bs4 import BeautifulSoup

# 目標站點
TargetHost = "http://adirectory.blog.com"
# User Agent
UserAgent  = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36'
# 鏈接采集規則
# 目錄鏈接采集規則
CategoryFind    = [{'findMode':'find','findTag':'div','rule':{'id':'cat-nav'}},
                   {'findMode':'findAll','findTag':'a','rule':{}}]
# 文章鏈接采集規則
ArticleListFind = [{'findMode':'find','findTag':'div','rule':{'id':'content'}},
                   {'findMode':'findAll','findTag':'h2','rule':{'class':'title'}},
                   {'findMode':'findAll','findTag':'a','rule':{}}]
# 分頁URL規則
PageUrl  = 'page/#page/'
PageStart = 1
PageStep  = 1
PageStopHtml = '404: Page Not Found'

def GetHtmlText(url):
    request  = urllib2.Request(url)
    request.add_header('Accept', "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp")
    request.add_header('Accept-Encoding', "*")
    request.add_header('User-Agent', UserAgent)
    return urllib2.urlopen(request).read()

def ArrToStr(varArr):
    returnStr = ""
    for s in varArr:
        returnStr += str(s)
    return returnStr


def GetHtmlFind(htmltext, findRule):
    findReturn = BeautifulSoup(htmltext)
    returnText = ""
    for f in findRule:
        if returnText != "":
            findReturn = BeautifulSoup(returnText)
        if f['findMode'] == 'find':
            findReturn = findReturn.find(f['findTag'], f['rule'])
        if f['findMode'] == 'findAll':
            findReturn = findReturn.findAll(f['findTag'], f['rule'])
        returnText = ArrToStr(findReturn)
    return findReturn

def GetCategory():
    categorys = [];
    htmltext = GetHtmlText(TargetHost)
    findReturn = GetHtmlFind(htmltext, CategoryFind)

    for tag in findReturn:
        print "[G]->Category:" + tag.string + "|Url:" + tag['href']

發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
主站蜘蛛池模板: 汝南县| 揭西县| 巫溪县| 兰州市| 西华县| 聂拉木县| 新野县| 隆回县| 荥阳市| 临桂县| 虹口区| 博客| 师宗县| 许昌县| 丽江市| 云和县| 三门峡市| 板桥市| 齐齐哈尔市| 阆中市| 稻城县| 邮箱| 盐池县| 香港 | 鞍山市| 绥滨县| 新邵县| 中方县| 永泰县| 德安县| 张家口市| 洪雅县| 无为县| 阳信县| 科技| 洞口县| 达州市| 绥德县| 新源县| 扎囊县| 富平县|