国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Python > 正文

基于scrapy實(shí)現(xiàn)的簡單蜘蛛采集程序

2020-01-04 19:28:34
字體:
供稿:網(wǎng)友

本文實(shí)例講述了基于scrapy實(shí)現(xiàn)的簡單蜘蛛采集程序。分享給大家供大家參考。具體如下:

# Standard Python library imports# 3rd party importsfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector import HtmlXPathSelector# My importsfrom poetry_analysis.items import PoetryAnalysisItemHTML_FILE_NAME = r'.+/.html'class PoetryParser(object): """ Provides common parsing method for poems formatted this one specific way. """ date_pattern = r'(/d{2} /w{3,9} /d{4})'def parse_poem(self, response):hxs = HtmlXPathSelector(response)item = PoetryAnalysisItem()# All poetry text is in pre tagstext = hxs.select('//pre/text()').extract()item['text'] = ''.join(text)item['url'] = response.url# head/title contains title - a poem by authortitle_text = hxs.select('//head/title/text()').extract()[0]item['title'], item['author'] = title_text.split(' - ')item['author'] = item['author'].replace('a poem by', '')for key in ['title', 'author']:item[key] = item[key].strip()item['date'] = hxs.select("http://p[@class='small']/text()").re(date_pattern)return itemclass PoetrySpider(CrawlSpider, PoetryParser): name = 'example.com_poetry' allowed_domains = ['www.example.com'] root_path = 'someuser/poetry/' start_urls = ['http://www.example.com/someuser/poetry/recent/','http://www.example.com/someuser/poetry/less_recent/'] rules = [Rule(SgmlLinkExtractor(allow=[start_urls[0] + HTML_FILE_NAME]),callback='parse_poem'),Rule(SgmlLinkExtractor(allow=[start_urls[1] + HTML_FILE_NAME]),callback='parse_poem')]

希望本文所述對大家的Python程序設(shè)計有所幫助。

發(fā)表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發(fā)表
主站蜘蛛池模板: 遂溪县| 马关县| 青神县| 六安市| 金乡县| 乳山市| 新竹市| 宁乡县| 龙口市| 广元市| 沧源| 射阳县| 宜川县| 赤城县| 阜新市| 饶河县| 安福县| 大英县| 文成县| 宁安市| 桃园县| 彭山县| 云霄县| 台中市| 延津县| 巨野县| 南郑县| 北京市| 韶山市| 奇台县| 东乡县| 乐业县| 屏东市| 隆化县| 松原市| 武乡县| 留坝县| 瑞安市| 堆龙德庆县| 徐水县| 全州县|