Nutch Web Crawl

Instructions for getting a quick web crawl going with latest versions of Solr and Nutch. Assumes JRE is installed and Solr are installed

tar -xzf apache-nutch-1.15-bin.tar.gz
cd apache-nutch-1.15

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>MySpider</value>
  </property>
</configuration>

https://dustinb.github.io

# accept anything else
+^https://dustinb.github.com

Tell Nutch which Solr core to post to, defaults to nutch can change in conf/index-writers.xml

<param name="url" value="http://localhost:8983/solr/nutch"/>

bin/crawl -i -s seed.txt Crawl 2

Written on March 6, 2019