国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 學(xué)院 > 開發(fā)設(shè)計(jì) > 正文

爬取微博用戶的原創(chuàng)微博

2019-11-10 23:20:03
字體:
供稿:網(wǎng)友

爬取微博用戶的原創(chuàng)微博,和圖片以及圖片的鏈接(防止圖片有下載不成功的,爬下來鏈接可以手動去下載不成功的)

爬的是weibo.cn 手機(jī)版微博 歡迎大家訪問我的github博客 以及github 歡迎star/fork 更改要爬取的用戶的id,和你登錄的Cookie

#-*-coding:utf8-*-import reimport stringimport sysimport osimport urllibimport urllib2from bs4 import BeautifulSoupimport requestsimport shutilimport timefrom lxml import etreereload(sys)sys.setdefaultencoding('utf-8')# if(len(sys.argv)>=2):# user_id = (int)(sys.argv[1])# else:# user_id = (int)(raw_input(u"please_input_id: "))user_id = 3805842931 #微博用戶IDcookie = {"Cookie": "_T_WM=6a0975bd8ce171d2c8b31e48d27993b7; ALF=1488452559; SCF=Aphq2I26dyB0N2ikBftYqeJKmR_jZE3ZQPpZ78yMq5h81f2xcKuQaFOIrBttHnTRrdjH3AFD9iDcHs6SKBQDyRQ.; SUB=_2A251lB6GDeRxGeNM4lQZ-S_Jzz6IHXVXdqLOrDV6PUJbktBeLXTTkW2fnHFXxkcPdpyC7aArA3VvccZDXg..; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WWug7UMvAm9Pg91a_h6o8Ye5JpX5o2p5NHD95Qfeo.c1h.pSKBEWs4DqcjZBXxCPXSQQg4rB7tt; SUHB=05i62K5ms4yYQ4; SSOLoginState=1485860566"}url = 'http://weibo.cn/u/%d?filter=1&page=1'%user_idhtml = requests.get(url, cookies = cookie).contentPRint u'user_id和cookie讀入成功'selector = etree.HTML(html)pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])result = ""urllist_set = set()Word_count = 1image_count = 1print u'ready'print pageNumsys.stdout.flush()times = 5one_step = pageNum/timesfor step in range(times): if step < times - 1: i = step * one_step + 1 j =(step + 1) * one_step + 1 else: i = step * one_step + 1 j =pageNum + 1 for page in range(i, j): #獲取lxml頁面 try: url = 'http://weibo.cn/u/%d?filter=1&page=%d'%(user_id,page) lxml = requests.get(url, cookies = cookie).content #文字爬取 selector = etree.HTML(lxml) content = selector.xpath('//span[@class="ctt"]') for each in content: text = each.xpath('string(.)') if word_count >= 3: text = "%d: "%(word_count - 2) +text+"/n" else : text = text+"/n/n" result = result + text word_count += 1 print page,'word ok' sys.stdout.flush() soup = BeautifulSoup(lxml, "lxml") urllist = soup.find_all('a',href=re.compile(r'^http://weibo.cn/mblog/or
發(fā)表評論 共有條評論
用戶名: 密碼:
驗(yàn)證碼: 匿名發(fā)表
主站蜘蛛池模板: 嘉兴市| 乡城县| 台中县| 甘孜| 余江县| 哈巴河县| 集贤县| 五原县| 德庆县| 甘南县| 芷江| 泰宁县| 得荣县| 莲花县| 宣化县| 舞钢市| 桐梓县| 山丹县| 贵阳市| 夹江县| 昭平县| 浠水县| 巴林左旗| 辽宁省| 黄龙县| 浠水县| 治多县| 苍山县| 高雄县| 湖北省| 布尔津县| 长岭县| 屯昌县| 普兰县| 祁门县| 郴州市| 葫芦岛市| 普兰县| 兴国县| 略阳县| 泊头市|