基于scrapy實(shí)現(xiàn)的簡單蜘蛛采集程序

2020-01-04 19:28:34

字體：大中小

供稿：網(wǎng)友

本文實(shí)例講述了基于scrapy實(shí)現(xiàn)的簡單蜘蛛采集程序。分享給大家供大家參考。具體如下：

# Standard Python library imports# 3rd party importsfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector import HtmlXPathSelector# My importsfrom poetry_analysis.items import PoetryAnalysisItemHTML_FILE_NAME = r'.+/.html'class PoetryParser(object): """ Provides common parsing method for poems formatted this one specific way. """ date_pattern = r'(/d{2} /w{3,9} /d{4})'def parse_poem(self, response):hxs = HtmlXPathSelector(response)item = PoetryAnalysisItem()# All poetry text is in pre tagstext = hxs.select('//pre/text()').extract()item['text'] = ''.join(text)item['url'] = response.url# head/title contains title - a poem by authortitle_text = hxs.select('//head/title/text()').extract()[0]item['title'], item['author'] = title_text.split(' - ')item['author'] = item['author'].replace('a poem by', '')for key in ['title', 'author']:item[key] = item[key].strip()item['date'] = hxs.select("http://p[@class='small']/text()").re(date_pattern)return itemclass PoetrySpider(CrawlSpider, PoetryParser): name = 'example.com_poetry' allowed_domains = ['www.example.com'] root_path = 'someuser/poetry/' start_urls = ['http://www.example.com/someuser/poetry/recent/','http://www.example.com/someuser/poetry/less_recent/'] rules = [Rule(SgmlLinkExtractor(allow=[start_urls[0] + HTML_FILE_NAME]),callback='parse_poem'),Rule(SgmlLinkExtractor(allow=[start_urls[1] + HTML_FILE_NAME]),callback='parse_poem')]

希望本文所述對大家的Python程序設(shè)計有所幫助。

上一篇：在Python上基于Markov鏈生成偽隨機(jī)文本的教程

下一篇：Python字符串的encode與decode研究心得亂碼問題解決方法