使用爬蟲爬取豆瓣2016電影榜單中所有電影

2019-11-10 17:08:48

字體：大中小

來源：轉載

供稿：網友

更多技術文章請訪問我的個人博客

爬蟲每日篇—-今天使用爬蟲爬取豆瓣2016電影榜上所有電影信息，本來以為豆瓣這種大社區的防御做的會很好，看到是HTTPS協議，我都準備寫一大串頭部去模擬用戶了，沒想到一個urlopen就直接獲取了，可能是網站設計者故意沒做的很封閉，讓我有機可乘。這是網址(https://www.douban.com/doulist/3516235/?start=0&sort=seq&sub_type=)，大家可以先看看。

如圖，這就是網頁的基本情況，大家可以先去看看網頁源代碼，我現在要做的就是把每個電影的整個`<div>`提取出來,代碼直接用urlopen弄了出來，我先保存到一個文件里，要慢慢的測試，直接用文件里的代碼就可以了，省得每次都抓取頁面。

# -*- coding: utf-8 -*-import urllib2import refrom bs4 import BeautifulSoupdef get_html(url): result = urllib2.urlopen(url) return result.read()def save_file(text, filename): f= open(filename,'w') f.write(text) f.close()def read_file(filename): f = open(filename,'r') text = f.read() f.close() return textif __name__=='__main__': url = 'https://www.douban.com/doulist/3516235/' html = get_html(url) save_file(html,'thefile.txt')

下一步對抓取的代碼開始提取，每個電影的介紹都包含在一對div中`<div class="bd doulist-subject"></div>`

使用BeautifulSoup來提取

html = read_file('thefile.txt') soup = BeautifulSoup(html) text = soup.find_all('div', class_='bd doulist-subject') save_file(str(text),'thefile.txt')

效果如下圖

每個電影的信息都提取出來了，我想要的是電影的名稱，評分，人員，上映日期，所以我只提取這些內容，大家可以按照自己的需求寫。

def get_movie_one(movie): result = [] soup_all = BeautifulSoup(str(movie)) title = soup_all.find_all('div', class_='title') soup_title = BeautifulSoup(str(title[0])) for line in soup_title.stripped_strings: # 對獲取到的<a>里的內容進行提取 PRint line num = soup_all.find_all('span', class_='rating_nums') print num[0].contents[0] info = soup_all.find_all('div', class_='abstract') soup_info = BeautifulSoup(str(info[0])) for line in soup_info.stripped_strings: # 對獲取到的<a>里的內容進行提取 print line## 結果：一切都好6.4導演: 張猛主演: 張國立 / 姚晨 / 竇驍類型: 劇情 / 家庭制片國家/地區: 中國大陸年份: 2016

第一個已經成功了，現在就開始批量的操作，一共有425個電影，我每提取一個就存到文件里，這是第一頁25個電影提取出來的效果。

下面是全部的代碼，大家可以參考一下。

#!/usr/bin/env python# -*- coding=utf-8 -*-import sysreload(sys)sys.setdefaultencoding( "utf-8" )import urllib2import reimport timefrom bs4 import BeautifulSoupdef get_html(url): #通過url獲取網頁內容 result = urllib2.urlopen(url) return result.read() # save_file(result.read(), 'thefile.txt')def get_movie_all(html): #通過soup提取到每個電影的全部信息，以list返回 soup = BeautifulSoup(html) movie_list = soup.find_all('div', class_='bd doulist-subject') return movie_listdef get_movie_one(movie): result = [] # 用于存儲提取出來的電影信息 soup_all = BeautifulSoup(str(movie)) title = soup_all.find_all('div', class_='title') soup_title = BeautifulSoup(str(title[0])) for line in soup_title.stripped_strings: # 對獲取到的<a>里的內容進行提取 result.append(line) # num = soup_all.find_all('span', class_='rating_nums') num = soup_all.find_all('span') result.append(num[1].contents[0]) soup_num = BeautifulSoup(str(num[0])) for line in soup_num.stripped_strings: # 對獲取到的<span>里的內容進行提取 result = result + line info = soup_all.find_all('div', class_='abstract') soup_info = BeautifulSoup(str(info[0])) result_str = "" for line in soup_info.stripped_strings: # 對獲取到的<div>里的內容進行提取 result_str = result_str + line result.append(result_str) return result #返回獲取到的結果def save_file(text, filename): #保存網頁到文件 f= open(filename,'ab') f.write(text) f.close()def read_file(filename): #讀取文件 f = open(filename,'r') text = f.read() f.close() return textif __name__=='__main__': for i in range(0,426,25): url = 'https://www.douban.com/doulist/3516235/?start='+str(i)+'&sort=seq&sub_type=' html = get_html(url) movie_list = get_movie_all(html) for movie in movie_list: #將每一頁中的每個電影信息放入函數中提取 result = get_movie_one(movie) text = ''+'電影名：'+str(result[0])+' | 評分：'+str(result[1])+' | '+str(result[2])+'/n'+'/t' save_file(text,'thee.txt') time.sleep(5) #每隔5秒抓取一頁的信息

更多技術文章請訪問我的個人博客

上一篇：IO流_BufferedInputStream讀取數據

下一篇：g++ 支持c11、c14的方法