国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 編程 > Python > 正文

Python爬蟲包BeautifulSoup實例(三)

2020-02-15 21:53:51
字體:
來源:轉載
供稿:網友

一步一步構建一個爬蟲實例,抓取糗事百科的段子

先不用beautifulsoup包來進行解析

第一步,訪問網址并抓取源碼

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-22 16:16:08# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ == '__main__':  # 訪問網址并抓取源碼  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'  headers = {'User-Agent':user_agent}  try:    request = urllib2.Request(url = url, headers = headers)    response = urllib2.urlopen(request)    content = response.read()  except urllib2.HTTPError as e:    print e    exit()  except urllib2.URLError as e:    print e    exit()  print content.decode('utf-8')

第二步,利用正則表達式提取信息

首先先觀察源碼中,你需要的內容的位置以及如何識別
然后用正則表達式去識別讀取
注意正則表達式中的 . 是不能匹配/n的,所以需要設置一下匹配模式。

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-22 16:16:08# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ == '__main__':  # 訪問網址并抓取源碼  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'  headers = {'User-Agent':user_agent}  try:    request = urllib2.Request(url = url, headers = headers)    response = urllib2.urlopen(request)    content = response.read()  except urllib2.HTTPError as e:    print e    exit()  except urllib2.URLError as e:    print e    exit()  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)  items = re.findall(regex, content)  # 提取數據  # 注意換行符,設置 . 能夠匹配換行符  for item in items:    print item

第三步,修正數據并保存到文件中

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-22 16:16:08# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-22 21:41:32import urllibimport urllib2import reimport osif __name__ == '__main__':  # 訪問網址并抓取源碼  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'  headers = {'User-Agent':user_agent}  try:    request = urllib2.Request(url = url, headers = headers)    response = urllib2.urlopen(request)    content = response.read()  except urllib2.HTTPError as e:    print e    exit()  except urllib2.URLError as e:    print e    exit()  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)  items = re.findall(regex, content)  # 提取數據  # 注意換行符,設置 . 能夠匹配換行符  path = './qiubai'  if not os.path.exists(path):    os.makedirs(path)  count = 1  for item in items:    #整理數據,去掉/n,將<br/>換成/n    item = item.replace('/n', '').replace('<br/>', '/n')    filepath = path + '/' + str(count) + '.txt'    f = open(filepath, 'w')    f.write(item)    f.close()    count += 1            
發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
主站蜘蛛池模板: 阳春市| 松潘县| 龙里县| 祁门县| 临高县| 论坛| 洛阳市| 桐柏县| 锦州市| 呼和浩特市| 内黄县| 林芝县| 昌黎县| 南华县| 浙江省| 平遥县| 寿光市| 海口市| 绿春县| 武隆县| 明光市| 个旧市| 平原县| 卢湾区| 临猗县| 吉安市| 嘉义市| 赣榆县| 夏津县| 宜丰县| 滨海县| 葫芦岛市| 珠海市| 宝鸡市| 庆城县| 简阳市| 塘沽区| 光山县| 隆子县| 锦州市| 东乌珠穆沁旗|