Nutch Web Crawl

Instructions for getting a quick web crawl going with latest versions of Solr and Nutch. Assumes JRE is installed and Solr are installed

  1. Unpack Nutch
tar -xzf apache-nutch-1.15-bin.tar.gz
cd apache-nutch-1.15
  1. Update conf/nutch-site.xml with crawler name
<configuration>
  <property>
    <name>http.agent.name</name>
    <value>MySpider</value>
  </property>
</configuration>
  1. Create seed.txt file
https://dustinb.github.io
  1. Update regex for urls in conf/regex-urlfilter.txt to only crawl your domain
# accept anything else
+^https://dustinb.github.com
  1. Tell Nutch which Solr core to post to, defaults to nutch can change in conf/index-writers.xml
<param name="url" value="http://localhost:8983/solr/nutch"/>
  1. Run the crawl
bin/crawl -i -s seed.txt Crawl 2
Written on March 6, 2019