Nutch Web Crawl

Instructions for getting a quick web crawl going with latest versions of Solr and Nutch. Assumes JRE is installed and Solr are installed

  1. Unpack Nutch
tar -xzf apache-nutch-1.15-bin.tar.gz
cd apache-nutch-1.15
  1. Update conf/nutch-site.xml with crawler name
  1. Create seed.txt file
  1. Update regex for urls in conf/regex-urlfilter.txt to only crawl your domain
# accept anything else
  1. Tell Nutch which Solr core to post to, defaults to nutch can change in conf/index-writers.xml
<param name="url" value="http://localhost:8983/solr/nutch"/>
  1. Run the crawl
bin/crawl -i -s seed.txt Crawl 2
Written on March 6, 2019