Nutch API建议

25-02-24 14

在这里，我们将给大家分享关于NutchAPI建议的知识，同时也会涉及到如何更有效地ApacheNutch1.0正式版发布、ApacheNutch1.1发布-下载、ApacheNutch1.1.3发布，

在这里，我们将给大家分享关于Nutch API建议的知识，同时也会涉及到如何更有效地Apache Nutch 1.0 正式版发布、Apache Nutch 1.1 发布-下载、Apache Nutch 1.1.3 发布，Web 爬虫、Apache Nutch 1.10 发布，搜索引擎的内容。

本文目录一览：

Nutch API建议
Apache Nutch 1.0 正式版发布
Apache Nutch 1.1 发布-下载
Apache Nutch 1.1.3 发布，Web 爬虫
Apache Nutch 1.10 发布，搜索引擎

Nutch API建议

我正在做一个需要成熟的爬虫来做一些工作的项目，为此我正在评估Nutch。我当前的需求相对简单：我需要一个能够将数据保存到磁盘的搜寻器，并且我需要它能够仅重新爬取站点的更新资源并跳过已经爬取的部分。有没有人有直接在Java中而不是通过命令行使用Nutch代码的经验。我想从简单开始：创建一个爬虫（或类似的爬虫），对其进行最低限度的配置并启动，没有什么幻想。是否有一些示例，或者我应该看一些资源？我正在阅读Nutch文档，但其中大部分是关于命令行，搜索和其他内容的。无需索引和搜索，Nutch爬行模块的可用性如何？任何帮助表示赞赏。谢谢。

答案1

小编典典

Nutch与您可能曾经尝试过的完全不同。因为它类似于框架，所以它不仅具有查询和搜索的前端，尽管solr似乎比本机的Nutch搜索前端更强大。它还具有爬网部分和索引（进入Lucene索引）。

如果要将爬网用于搜索以外的其他目的，则需要开发自己的程序，并熟悉Hadoop和MapReduce编程。

不知道要对爬网做什么，但看起来Nutch并不是解决方案

Apache Nutch 1.0 正式版发布

Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

这个版本包含了大量的bug修复以及提升，包括集成 Solr、全新的索引框架以及一个评分框架。

详情请看这里：http://www.apache.org/dist/lucene/nutch/CHANGES-1.0.txt

下载地址：http://lucene.apache.org/nutch/release/

Apache Nutch 1.1 发布-下载

Nutch 是一个开源 Java 实现的搜索引擎，基于 Lucene 实现。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

Nutch 好不容易才发布一个新版本，1.1 的发布离上次 1.0 发布已经超过一年了。

Nutch 1.1 版本的改进内容太多，详情请看 Changes。

下载 Nutch 1.1

Apache Nutch 1.1.3 发布，Web 爬虫

Apache Nutch 项目管理委员宣布 Apache Nutch 1.13 发布，建议所有当前的用户和 1.X 系列的开发人员升级到此版本。

Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置，这对于批处理非常有用。

更新内容：

Sub-task

[NUTCH-2246] - Refactor /seed endpoint for backward compatibility

Bug

[NUTCH-1553] - Property ''indexer.delete.robots.noindex'' not working when using parser-html.
[NUTCH-2242] - lastModified not always set
[NUTCH-2291] - Fix mrunit dependencies
[NUTCH-2337] - urlnormalizer-basic to strip empty port
[NUTCH-2345] - FetchItemQueue logs are logged with wrong class name
[NUTCH-2349] - urlnormalizer-basic NPE for ill-formed URL "http:/"
[NUTCH-2357] - Index metadata throw Exception because writable object cannot be cast to Text
[NUTCH-2359] - Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed
[NUTCH-2364] - http.agent.rotate: IllegalArgumentException / last element of agent names ignored
[NUTCH-2366] - Deprecated Job constructor in hostdb/ReadHostDb.java

改进

[NUTCH-1308] - Add main() to ZipParser
[NUTCH-2164] - Inconsistent ''Modified Time'' in crawl db
[NUTCH-2234] - Upgrade to elasticsearch 2.3.3
[NUTCH-2236] - Upgrade to Hadoop 2.7.2
[NUTCH-2262] - Utilize parameterized logging notation across Fetcher
[NUTCH-2272] - Index checker server to optionally keep client connection open
[NUTCH-2286] - CrawlDbReader -stats to show fetch time and interval
[NUTCH-2287] - Indexer-elastic plugin should use Elasticsearch BulkProcessor and BackoffPolicy
[NUTCH-2299] - Remove obsolete properties protocol.plugin.check.*
[NUTCH-2300] - Fetcher to optionally save robots.txt
[NUTCH-2327] - Seeds injected in REST workflow must be ingested into HDFS
[NUTCH-2329] - Update Slf4j logging for Java 8 and upgrade miredot plugin version
[NUTCH-2336] - SegmentReader to implement Tool
[NUTCH-2352] - Log with Generic Class Name at Nutch 1.x
[NUTCH-2355] - Protocol plugins to set cookie if Cookie metadata field is present
[NUTCH-2367] - Get single record from HostDB

新特性

[NUTCH-2132] - Publisher/Subscriber model for Nutch to emit events

Task

[NUTCH-2171] - Upgrade Nutch Trunk to Java 1.8

下载地址：

http://nutch.apache.org/downloads.html

Apache Nutch 1.10 发布，搜索引擎

Apache Nutch 1.10 发布，此版本现已提供下载：http://syncope.apache.org/downloads.html。

更新内容：

Bug 修复

[SYNCOPE-654] - Some generic and uninformative error messages
[SYNCOPE-655] - Files under /etc/apache-syncope ignored
[SYNCOPE-656] - Debian configuration files overwrittern
[SYNCOPE-658] - Duplicate derived attribute after sync task when it is configured as accountid for the synched resource
[SYNCOPE-659] - Wrong fasterxml.jackson, common-lang3 version in the Import-Package in the syncope-common, syncope-client
[SYNCOPE-664] - Empty string values not allowed with Oracle DB

改进

[SYNCOPE-663] - Option to ignore users / roles during synchronization or push

完整改进请看：http://s.apache.org/S4Z。

Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

Nutch的创始人是Doug Cutting，他同时也是Lucene、Hadoop和Avro开源项目的创始人。

Nutch 诞生于2002年8月，是Apache旗下的一个用Java实现的开源搜索引擎项目，自Nutch1.2版本之后，Nutch已经从搜索引擎演化为网络爬虫，接着Nutch进一步演化为两大分支版本：1.X和2.X，这两大分支最大的区别在于2.X对底层的数据存储进行了抽象以支持各种底层存储技术。

在 Nutch的进化过程中，产生了Hadoop、Tika、Gora和Crawler Commons四个Java开源项目。如今这四个项目都发展迅速，极其火爆，尤其是Hadoop，其已成为大规模数据处理的事实上的标准。Tika使用多种现有的开源内容解析项目来实现从多种格式的文件中提取元数据和结构化文本，Gora支持把大数据持久化到多种存储实现，Crawler Commons是一个通用的网络爬虫组件。

Nutch 致力于让每个人能很容易, 同时花费很少就可以配置世界一流的Web搜索引擎. 为了完成这一宏伟的目标, Nutch必须能够做到:

每个月取几十亿网页
为这些网页维护一个索引
对索引文件进行每秒上千次的搜索
提供高质量的搜索结果
以最小的成本运作

在线Javadoc：http://tool.oschina.net/apidocs/apidoc?api=nutch2.0

今天的关于Nutch API建议的分享已经结束，谢谢您的关注，如果想了解更多关于Apache Nutch 1.0 正式版发布、Apache Nutch 1.1 发布-下载、Apache Nutch 1.1.3 发布，Web 爬虫、Apache Nutch 1.10 发布，搜索引擎的相关知识，请在本站进行查询。

本文标签：