package com.dk.spider.spider_01; import java.util.Arrays; import java.util.Collection; import java.util.HashSet; import java.util.LinkedList; import java.util.NoSuchElementException; import java.util.Set; public class UrlQueue { LinkedList<String> unvisitedList; UrlQueue(String[] seeds) { visitedSet = new HashSet<>(); unvisitedList = new LinkedList<>(); unvisitedList.addAll(Arrays.asList(seeds)); } /** * 添加url * * @param url enQueue(String url) { if (url != null && !visitedSet.contains(url)) { unvisitedList.addLast(url); } } /** * 添加url * * @param urls enQueue(Collection<String> urls) { for (String url : urls) { enQueue(url); } } /** * 取出url * * String deQueue() { try { String url = unvisitedList.removeFirst(); while(visitedSet.contains(url)) { url = unvisitedList.removeFirst(); } visitedSet.add(url); return url; } catch (NoSuchElementException e) { System.err.println("URL取光了"); } return null; } /** * 得到已经请求过的url的数目 * * getVisitedCount() { return visitedSet.size(); } }
(2) 测试
下面进行测试,我们来抓取园子里排行No1的Artech的文章,以他的博客首页地址:作为种子节点。通过分析发现,形如:…和…的链接都是有效的文章地址,而形如:…的链接是下一页链接,这些都作为我们筛选url的依据。我们采用宽度优先遍历策略。Artech的文章数是500余篇,因此我们以请求页面数达到1000或遍历完所有满足条件的url为终止条件。下面是具体的测试代码:
package com.dk.spider.spider_01; import java.util.Set; import org.jsoup.nodes.Document; public class Main { main(String[] args) { UrlQueue urlQueue = new UrlQueue(new String[] { "" }); JsoupDownloader downloader = JsoupDownloader.getInstance(); long start = System.currentTimeMillis(); while (urlQueue.getVisitedCount() < 1000) { String url = urlQueue.deQueue(); if (url == null) { break; } Document doc = downloader.downloadPage(url); if (doc == null) { continue; } Set<String> urlSet = downloader.parsePage(doc, "(|http://www.cnblogs.com/artech/default|\\d{4}/\\d{2}/\\d{2}/).*"); urlQueue.enQueue(urlSet); downloader.savePage(doc, "C:/Users/Administrator/Desktop/test/", null, "(|http://www.cnblogs.com/artech/archive/\\d{4}/\\d{2}/\\d{2}/).*"); System.out.println("已请求" + urlQueue.getVisitedCount() + "个页面"); } long end = System.currentTimeMillis(); System.out.println(">>>>>>>>>>抓去完成,共抓取" + urlQueue.getVisitedCount() + "到个页面,用时" + ((end - start) / 1000) + "s<<<<<<<<<<<<"); } }
运行结果:
4. 总结
仔细分析以上过程,还有许多值得优化改进的地方: