01_python爬蟲_五種方法通過黑板客第一關

2019-11-06 08:03:06

字體：大中小

來源：轉載

供稿：網友

在網上找到了一個練習爬蟲的網站，挺有意思的，第一關網址： http://www.heibanke.com/lesson/crawler_ex00/

頁面如下：

第一關的規則就是在網址后面輸入數字，

然后打開下一個頁面，之后重復如此，直到通關為止，

因此手動的輸入有些繁瑣，就需要用爬蟲來完成

第一種方法

使用urllib和正則表達式

#!/usr/bin/python# coding:utf-8#注意事項：在linux平臺上，前面兩句注釋是這樣寫的，尤其是第一句沒有空格。#本程序是用于python爬蟲練習，用于在黑板客上闖關所用。#程序分析：打開黑板客首頁：http://www.heibanke.com/lesson/crawler_ex00/#發現第一關就是讓你不停的更換域名，然后打開新的網頁# 那思路如下：# 1.網頁打開模塊# 2.在打開的網頁中通過bs4或者正則表達式獲取網頁中的數字串，然后組成新的網頁地址再次打開，然后一直重復。import reimport urllibimport datetimebegin_time=datetime.datetime.now()url = 'http://www.heibanke.com/lesson/crawler_ex00/'html = urllib.urlopen(url).read()index=re.findall(r'輸入數字([0-9]{5})',html)while index:	url='http://www.heibanke.com/lesson/crawler_ex00/%s/' % index[0]	PRint url	html=urllib.urlopen(url) .read() 	index=re.findall(r'數字是([0-9]{5})',html)html=urllib.urlopen(url).read() url='http://www.heibanke.com'+re.findall(r'<a href="(.*?)" class',html )[0]print '最后通關的的網址是%s, 耗時%s' % (url,(datetime.datetime.now()-begin_time))print 'just for test,是吧！最終結果如下：
>>> http://www.heibanke.com/lesson/crawler_ex00/64899/http://www.heibanke.com/lesson/crawler_ex00/36702/http://www.heibanke.com/lesson/crawler_ex00/83105/http://www.heibanke.com/lesson/crawler_ex00/25338/http://www.heibanke.com/lesson/crawler_ex00/19016/http://www.heibanke.com/lesson/crawler_ex00/13579/http://www.heibanke.com/lesson/crawler_ex00/43396/http://www.heibanke.com/lesson/crawler_ex00/39642/http://www.heibanke.com/lesson/crawler_ex00/96911/http://www.heibanke.com/lesson/crawler_ex00/30965/http://www.heibanke.com/lesson/crawler_ex00/67917/http://www.heibanke.com/lesson/crawler_ex00/22213/http://www.heibanke.com/lesson/crawler_ex00/72586/http://www.heibanke.com/lesson/crawler_ex00/48151/http://www.heibanke.com/lesson/crawler_ex00/53639/http://www.heibanke.com/lesson/crawler_ex00/10963/http://www.heibanke.com/lesson/crawler_ex00/65392/http://www.heibanke.com/lesson/crawler_ex00/36133/http://www.heibanke.com/lesson/crawler_ex00/72324/http://www.heibanke.com/lesson/crawler_ex00/57633/http://www.heibanke.com/lesson/crawler_ex00/91251/http://www.heibanke.com/lesson/crawler_ex00/87016/http://www.heibanke.com/lesson/crawler_ex00/77055/http://www.heibanke.com/lesson/crawler_ex00/30366/http://www.heibanke.com/lesson/crawler_ex00/83679/http://www.heibanke.com/lesson/crawler_ex00/31388/http://www.heibanke.com/lesson/crawler_ex00/99446/http://www.heibanke.com/lesson/crawler_ex00/69428/http://www.heibanke.com/lesson/crawler_ex00/34798/http://www.heibanke.com/lesson/crawler_ex00/16780/http://www.heibanke.com/lesson/crawler_ex00/36499/http://www.heibanke.com/lesson/crawler_ex00/21070/http://www.heibanke.com/lesson/crawler_ex00/96749/http://www.heibanke.com/lesson/crawler_ex00/71822/http://www.heibanke.com/lesson/crawler_ex00/48739/http://www.heibanke.com/lesson/crawler_ex00/62816/http://www.heibanke.com/lesson/crawler_ex00/80182/http://www.heibanke.com/lesson/crawler_ex00/68171/http://www.heibanke.com/lesson/crawler_ex00/45458/http://www.heibanke.com/lesson/crawler_ex00/56056/http://www.heibanke.com/lesson/crawler_ex00/87450/http://www.heibanke.com/lesson/crawler_ex00/52695/http://www.heibanke.com/lesson/crawler_ex00/36675/http://www.heibanke.com/lesson/crawler_ex00/25997/http://www.heibanke.com/lesson/crawler_ex00/73222/http://www.heibanke.com/lesson/crawler_ex00/93891/http://www.heibanke.com/lesson/crawler_ex00/29052/http://www.heibanke.com/lesson/crawler_ex00/72996/http://www.heibanke.com/lesson/crawler_ex00/73999/http://www.heibanke.com/lesson/crawler_ex00/23814/最后通關的的網址是http://www.heibanke.com/lesson/crawler_ex01/, 耗時0:00:49.396000just for test,是吧！>>>第二種方法
使用request 和 re  模塊配合#!/usr/bin/python# coding:utf-8#通過urllib 的方法獲取網頁內容，通過正則表達式獲取所需的字符import requestsimport reimport datetime,sysreload(sys)sys.setdefaultencoding('utf-8')begin_time=datetime.datetime.now()url = r'http://www.heibanke.com/lesson/crawler_ex00/'new_url = urlnum_re = re.compile(r'<h3>[^/d<]*?(/d+)[^/d<]*?</h3')while True:	print '正在讀取網址 ',new_url	html = requests.get(new_url).text	num = num_re.findall(html)	if len(num) == 0:		new_url='http://www.heibanke.com'+re.findall(r'<a href="(.*?)" class',html )[0]		break;	else:		new_url = url+num[0]print '最后通關的的網址是%s, 耗時%s' % (new_url,(datetime.datetime.now()-begin_time))print 'just for 測試!最終耗時為：最后通關的的網址是http://www.heibanke.com/lesson/crawler_ex01/, 耗時0:01:37.520779just for 測試!這里還有一種正則匹配方式，可以借鑒一下pattern = r'<h3>(.*)</h3>'result = re.findall(pattern, content)try:	num = int(''.join(map(lambda n: n if n.isdigit() else '', result[0])))except:	break這里涉及到了幾個函數：join（）函數map（）函數以及lambda的使用join（）函數
其實就是一個拼接函數，看下面的幾個例子>>> st1=['hello','world','','','j','i','m']#以空字符串來進行分割，其實就是直接將list 里面的元素重新連接在了一起>>> ''.join(st1)'helloworldjim'#以 ‘.’ 小數點來進行連接， 這樣，原本是空字符的元素也要占用相應的位置 >>> '.'.join(st1)'hello.world...j.i.m'#同樣的道理，針對字符串也適用>>> st2='this is sendy'>>> ''.join(st2)'this is sendy'>>> ':'.join(st2)'t:h:i:s: :i:s: :s:e:n:d:y'>>>join()函數語法：'sep'.join(seq)參數說明sep：分隔符。可以為空seq：要連接的元素序列、字符串、元組、字典上面的語法即：以sep作為分隔符，將seq所有的元素合并成一個新的字符串返回值：返回一個以分隔符sep連接各個元素后生成的字符串map()函數
 傳入的list的每一個元素進行映射，返回一個新的映射之后的listdef format_name(s):    s1=s[0:1].upper()+s[1:].lower();    return s1;print map(format_name, ['adam', 'LISA', 'barT'])輸入：['adam', 'LISA', 'barT']輸出：['Adam', 'Lisa', 'Bart']map()是 Python 內置的高階函數，它接收一個函數 f 和一個 list，并通過把函數 f 依次作用在 list 的每個元素上，得到一個新的 list 并返回。lambda的使用
它的作用類似于def 語句， 即用關鍵字 lambda來簡寫一個函數>>> aa=lambda : True if 4>6 else False>>> aa()False>>> aa = lambda sr1:sr1+1>>> aa(5)6lambda存在意義就是對簡單函數的簡潔表示第三種方法
通過urllib2 和re 庫來實現#!/usr/bin/python# coding:utf-8#通過urllib2 的方法打開網頁，獲取網頁內容，網頁里面的內容則通過正則表達式來匹配import reimport urllib2import datetime  begin_time=datetime.datetime.now()url = 'http://www.heibanke.com/lesson/crawler_ex00/'html = urllib2.urlopen(url).read()index=re.findall(r'輸入數字([0-9]{5})',html)while index:	url='http://www.heibanke.com/lesson/crawler_ex00/%s/' % index[0]	print url	html=urllib2.urlopen(url) .read() 	index=re.findall(r'數字是([0-9]{5})',html)html=urllib2.urlopen(url).read() url='http://www.heibanke.com'+re.findall(r'<a href="(.*?)" class',html )[0]print '最后通關的的網址是%s, 耗時%s' % (url,(datetime.datetime.now()-begin_time))最終耗時：最后通關的的網址是http://www.heibanke.com/lesson/crawler_ex01/, 耗時0:00:42.172931第四種方法
使用urllib2,re和BeautifulSoup庫來實現#!/usr/bin/python# coding:utf-8#這個方法使用 bs4 即beautiful 獲取有用的信息，然后將獲取到的數據通過正則表達式進行處理import reimport urllib2import datetimefrom bs4 import BeautifulSoupbegin_time=datetime.datetime.now()url = 'http://www.heibanke.com/lesson/crawler_ex00/'  url2=urlwhile True:	print '正在爬取',url2	html = urllib2.urlopen(url2).read()	soup = BeautifulSoup(html,'html.parser',from_encoding='utf8')	str1=soup.find_all('h3') #獲取信息內容	str2= (''.join(str1[0])) #通過這種處理得到字符串	str3=re.findall(r'[/d]{5}',str2)#通過正則表達式得到數字	if len(str3) == 0:#對數字長度進行判斷，可以在最后跳出循環		new_url='http://www.heibanke.com'+re.findall(r'<a href="(.*?)" class',html )[0]		break;	else:		url2=url+str3[0]#對url進行重組，可以獲得下一個urlprint '最后通關的的網址是%s, 耗時%s' % (url,(datetime.datetime.now()-begin_time))最終耗時：最后通關的的網址是http://www.heibanke.com/lesson/crawler_ex00/, 耗時0:00:43.508280第五種方法
使用 webdriver與re 正則表達式配合#!/usr/bin/python# coding:utf-8#這個方法使用 webdriver獲取頁面內容 ，然后將獲取到的數據通過正則表達式進行處理import reimport datetimefrom selenium import webdriverimport sysreload(sys)sys.setdefaultencoding('utf-8')begin_time=datetime.datetime.now()url = 'http://www.heibanke.com/lesson/crawler_ex00/'  driver=webdriver.PhantomJS()driver.get(url)content= driver.find_element_by_tag_name('h3').textprint contentcontent=re.findall('([0-9]{5})',content)while True:	if len(content) == 0:#對數字長度進行判斷，可以在最后跳出循環		content= driver.find_element_by_xpath('/html/body/div/div/div[2]/a')		url=content.get_attribute('href')		break;	else:		url='http://www.heibanke.com/lesson/crawler_ex00/%s' % content[0] #對url進行重組，可以獲得下一個url		driver.get(url)		content= driver.find_element_by_tag_name('h3').text		print content		content=re.findall('([0-9]{5})',content)print '最后通關的的網址是%s, 耗時%s' % (url,(datetime.datetime.now()-begin_time))driver.quit()耗時：恭喜你,你找到了答案.繼續你的爬蟲之旅吧最后通關的的網址是http://www.heibanke.com/lesson/crawler_ex01/, 耗時0:02:07.484190小技巧
這里面有個小技巧，可以獲取程序運行的時間：datetime.datetime.now()在程序開始和結束的時候都執行一下這一句，然后將結果相減就獲得了程序運行的時間。