斗破苍穹续集,小说排行榜,好看的小说完本推荐

網(wǎng)絡(luò)爬蟲(chóng)是一個(gè)自動(dòng)提取網(wǎng)頁(yè)的程序，它為搜索引擎從萬(wàn)維網(wǎng)上下載網(wǎng)頁(yè)，是搜索引擎的重要組成，其基本架構(gòu)如下圖所示：

傳統(tǒng)爬蟲(chóng)從一個(gè)或若干初始網(wǎng)頁(yè)的URL開(kāi)始，獲得初始網(wǎng)頁(yè)上的URL，在抓取網(wǎng)頁(yè)的過(guò)程中，不斷從當(dāng)前頁(yè)面上抽取新的URL放入隊(duì)列，直到滿足系統(tǒng)的一定停止條件。對(duì)于垂直搜索來(lái)說(shuō)，聚焦爬蟲(chóng)，即有針對(duì)性地爬取特定主題網(wǎng)頁(yè)的爬蟲(chóng)，更為適合。

本文爬蟲(chóng)程序的核心代碼如下：

java代碼

public void crawl() throws Throwable {
while (continueCrawling()) {
CrawlerUrl url = getNextUrl(); //獲取待爬取隊(duì)列中的下一個(gè)URL
if (url != null) {
PRintCrawlInfo();
String content = getContent(url); //獲取URL的文本信息
//聚焦爬蟲(chóng)只爬取與主題內(nèi)容相關(guān)的網(wǎng)頁(yè)，這里采用正則匹配簡(jiǎn)單處理
if (isContentRelevant(content, this.regexpSearchPattern)) {
saveContent(url, content); //保存網(wǎng)頁(yè)至本地
//獲取網(wǎng)頁(yè)內(nèi)容中的鏈接，并放入待爬取隊(duì)列中
Collection urlStrings = extractUrls(content, url);
addUrlsToUrlQueue(url, urlStrings);
} else {
System.out.println(url + " is not relevant ignoring ...");
}
//延時(shí)防止被對(duì)方屏蔽
Thread.sleep(this.delayBetweenUrls);
}
}
cloSEOutputStream();
}

整個(gè)函數(shù)由getNextUrl、getContent、isContentRelevant、extractUrls、addUrlsToUrlQueue等幾個(gè)核心方法組成，下面將一一介紹。先看getNextUrl：

Java代碼

private CrawlerUrl getNextUrl() throws Throwable {
CrawlerUrl nextUrl = null;
while ((nextUrl == null) && (!urlQueue.isEmpty())) {
CrawlerUrl crawlerUrl = this.urlQueue.remove();
//doWeHavePermissionToVisit：是否有權(quán)限訪問(wèn)該URL，友好的爬蟲(chóng)會(huì)根據(jù)網(wǎng)站提供的"Robot.txt"中配置的規(guī)則進(jìn)行爬取
//isUrlAlreadyVisited：URL是否訪問(wèn)過(guò)，大型的搜索引擎往往采用BloomFilter進(jìn)行排重，這里簡(jiǎn)單使用HashMap
//isDepthAcceptable：是否達(dá)到指定的深度上限。爬蟲(chóng)一般采取廣度優(yōu)先的方式。一些網(wǎng)站會(huì)構(gòu)建爬蟲(chóng)陷阱（自動(dòng)生成一些無(wú)效鏈接使爬蟲(chóng)陷入死循環(huán)），采用深度限制加以避免
if (doWeHavePermissionToVisit(crawlerUrl)
&& (!isUrlAlreadyVisited(crawlerUrl))
&& isDepthAcceptable(crawlerUrl)) {
nextUrl = crawlerUrl;
// System.out.println("Next url to be visited is " + nextUrl);
}
}
return nextUrl;
}

更多的關(guān)于robot.txt的具體寫(xiě)法，可參考以下這篇文章：

http://www.bloghuman.com/post/67/

getContent內(nèi)部使用apache的httpclient 4.1獲取網(wǎng)頁(yè)內(nèi)容，具體代碼如下：

Java代碼

private String getContent(CrawlerUrl url) throws Throwable {
//HttpClient4.1的調(diào)用與之前的方式不同
HttpClient client = new DefaultHttpClient();
HttpGet httpGet = new HttpGet(url.getUrlString());
StringBuffer strBuf = new StringBuffer();
HttpResponse response = client.execute(httpGet);
if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) {
HttpEntity entity = response.getEntity();
if (entity != null) {
BufferedReader reader = new BufferedReader(
new InputStreamReader(entity.getContent(), "UTF-8"));
String line = null;
if (entity.getContentLength() > 0) {
strBuf = new StringBuffer((int) entity.getContentLength());
while ((line = reader.readLine()) != null) {
strBuf.append(line);
}
}
}
if (entity != null) {
entity.consumeContent();
}
}
//將url標(biāo)記為已訪問(wèn)
markUrlAsVisited(url);
return strBuf.toString();
}

對(duì)于垂直型應(yīng)用來(lái)說(shuō)，數(shù)據(jù)的準(zhǔn)確性往往更為重要。聚焦型爬蟲(chóng)的主要特點(diǎn)是，只收集和主題相關(guān)的數(shù)據(jù)，這就是isContentRelevant方法的作用。這里或許要使用分類(lèi)預(yù)測(cè)技術(shù)，為簡(jiǎn)單起見(jiàn)，采用正則匹配來(lái)代替。其主要代碼如下：

Java代碼

public static boolean isContentRelevant(String content,
Pattern regexpPattern) {
boolean retValue = false;
if (content != null) {
//是否符合正則表達(dá)式的條件
Matcher m = regexpPattern.matcher(content.toLowerCase());
retValue = m.find();
}
return retValue;
}

extractUrls的主要作用，是從網(wǎng)頁(yè)中獲取更多的URL，包括內(nèi)部鏈接和外部鏈接，代碼如下：

Java代碼

public List extractUrls(String text, CrawlerUrl crawlerUrl) {
Map urlMap = new HashMap();
extractHttpUrls(urlMap, text);
extractRelativeUrls(urlMap, text, crawlerUrl);
return new ArrayList(urlMap.keySet());
}
//處理外部鏈接
private void extractHttpUrls(Map urlMap, String text) {
Matcher m = httpRegexp.matcher(text);
while (m.find()) {
String url = m.group();
String[] terms = url.split("a href=/"");
for (String term : terms) {
// System.out.println("Term = " + term);
if (term.startsWith("http")) {
int index = term.indexOf("/"");
if (index > 0) {
term = term.substring(0, index);
}
urlMap.put(term, term);
System.out.println("Hyperlink: " + term);
}
}
}
}
//處理內(nèi)部鏈接
private void extractRelativeUrls(Map urlMap, String text,
CrawlerUrl crawlerUrl) {
Matcher m = relativeRegexp.matcher(text);
URL textURL = crawlerUrl.getURL();
String host = textURL.getHost();
while (m.find()) {
String url = m.group();
String[] terms = url.split("a href=/"");
for (String term : terms) {
if (term.startsWith("/")) {
int index = term.indexOf("/"");
if (index > 0) {
term = term.substring(0, index);
}
String s = "http://" + host + term;
urlMap.put(s, s);
System.out.println("Relative url: " + s);
}
}
}
}

如此，便構(gòu)建了一個(gè)簡(jiǎn)單的網(wǎng)絡(luò)爬蟲(chóng)程序，可以使用以下程序來(lái)測(cè)試它：

Java代碼

public static void main(String[] args) {
try {
String url = "http://www.amazon.com";
Queue urlQueue = new LinkedList();
String regexp = "java";
urlQueue.add(new CrawlerUrl(url, 0));
NaiveCrawler crawler = new NaiveCrawler(urlQueue, 100, 5, 1000L,
regexp);
// boolean allowCrawl = crawler.areWeAllowedToVisit(url);
// System.out.println("Allowed to crawl: " + url + " " +
// allowCrawl);
crawler.crawl();
} catch (Throwable t) {
System.out.println(t.toString());
t.printStackTrace();
}
}

當(dāng)然，你可以為它賦予更為高級(jí)的功能，比如多線程、更智能的聚焦、結(jié)合Lucene建立索引等等。更為復(fù)雜的情況，可以考慮使用一些開(kāi)源的蜘蛛程序，比如Nutch或是Heritrix等等，就不在本文的討論范圍了。

国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

網(wǎng)絡(luò)爬蟲(chóng)大白話解析