Browsertrix Crawler on Cloudron
-
"Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses puppeteer-cluster and puppeteer to control one or more browsers in parallel."
One brilliant use for this tool is demonstrated by Zimit, which enables you to nominate a website, crawl it and then archive its webpages into .zim file which is readable offline in Kiwix.
https://github.com/webrecorder/browsertrix-crawler
GPL v3
Docker Image is availablehttps://webrecorder.net/tools#browsertrix
Zimit:
https://youzim.it
Kiwix:
https://kiwix.org
.zim
https://www.openzim.org/wiki/ZIM_file_formatThis is a complex piece of software and the busy maintainer would like to help make it easier to use. It might be possible for open instances of Browsertix-Crawler to help scale-up the power of a crawl on larger websites. I suppose it might be possible to share results of crawls between co-operating instances too, at some stage.
For example, you can create a dump of all of Wikipedia, about 60GB, compressed into a .zim, and then browse it offline.