Nutch Web Crawl
Instructions for getting a quick web crawl going with latest versions of Solr and Nutch. Assumes JRE is installed and Solr are installed
- Unpack Nutch
tar -xzf apache-nutch-1.15-bin.tar.gz
cd apache-nutch-1.15
- Update
conf/nutch-site.xml
with crawler name
<configuration>
<property>
<name>http.agent.name</name>
<value>MySpider</value>
</property>
</configuration>
- Create seed.txt file
https://dustinb.github.io
- Update regex for urls in
conf/regex-urlfilter.txt
to only crawl your domain
# accept anything else
+^https://dustinb.github.com
- Tell Nutch which Solr core to post to, defaults to
nutch
can change inconf/index-writers.xml
<param name="url" value="http://localhost:8983/solr/nutch"/>
- Run the crawl
bin/crawl -i -s seed.txt Crawl 2
Written on March 6, 2019