node.js基礎(chǔ)模塊http、網(wǎng)頁分析工具cherrio實(shí)現(xiàn)爬蟲

2019-11-20 10:25:16

字體：大中小

供稿：網(wǎng)友

一、前言
說是爬蟲初探，其實(shí)并沒有用到爬蟲相關(guān)第三方類庫，主要用了node.js基礎(chǔ)模塊http、網(wǎng)頁分析工具cherrio。使用http直接獲取url路徑對應(yīng)網(wǎng)頁資源，然后使用cherrio分析。這里我主要學(xué)習(xí)過的案例自己敲了一遍，加深理解。在coding的過程中，我第一次把jq獲取后的對象直接用forEach遍歷，直接報(bào)錯，是因?yàn)閖q沒有對應(yīng)的這個方法，只有js數(shù)組可以調(diào)用。

二、知識點(diǎn)
①：superagent抓去網(wǎng)頁工具。我暫時(shí)未用到。
②：cherrio 網(wǎng)頁分析工具，你可以理解其為服務(wù)端的jQuery，因?yàn)檎Z法都一樣。
效果圖

1、抓取整個網(wǎng)頁

2、分析后的數(shù)據(jù)，提供的示例為案例實(shí)現(xiàn)的例子。

爬蟲初探源碼分析

var http=require('http');var cheerio=require('cheerio'); var url='http://www.imooc.com/learn/348'; /****************************打印得到的數(shù)據(jù)結(jié)構(gòu)[{ chapterTitle:'', videos:[{  title:'',  id:'' }]}]********************************/function printCourseInfo(courseData){ courseData.forEach(function(item){  var chapterTitle=item.chapterTitle;  console.log(chapterTitle+'/n');  item.videos.forEach(function(video){   console.log(' 【'+video.id+'】'+video.title+'/n');  }) });}  /*************分析從網(wǎng)頁里抓取到的數(shù)據(jù)**************/function filterChapter(html){ var courseData=[];  var $=cheerio.load(html); var chapters=$('.chapter'); chapters.each(function(item){  var chapter=$(this);  var chapterTitle=chapter.find('strong').text(); //找到章節(jié)標(biāo)題  var videos=chapter.find('.video').children('li');   var chapterData={   chapterTitle:chapterTitle,   videos:[]  };   videos.each(function(item){   var video=$(this).find('.studyvideo');   var title=video.text();   var id=video.attr('href').split('/video')[1];    chapterData.videos.push({    title:title,    id:id   })  })   courseData.push(chapterData); });  return courseData;} http.get(url,function(res){ var html='';  res.on('data',function(data){  html+=data; })  res.on('end',function(){  var courseData=filterChapter(html);  printCourseInfo(courseData); })}).on('error',function(){ console.log('獲取課程數(shù)據(jù)出錯');})

參考資料：
https://github.com/alsotang/node-lessons/tree/master/lesson3

http://www.imooc.com/video/7965

上一篇：JavaScript資源預(yù)加載組件和滑屏組件的使用推薦

下一篇：ES6中如何使用Set和WeakSet