Browsertrix Crawler on Cloudron

LoudLemur

"Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses puppeteer-cluster and puppeteer to control one or more browsers in parallel."

One brilliant use for this tool is demonstrated by Zimit, which enables you to nominate a website, crawl it and then archive its webpages into .zim file which is readable offline in Kiwix.

https://github.com/webrecorder/browsertrix-crawler
GPL v3
Docker Image is available

https://webrecorder.net/tools#browsertrix

Zimit:
https://youzim.it
Kiwix:
https://kiwix.org
.zim
https://www.openzim.org/wiki/ZIM_file_format

This is a complex piece of software and the busy maintainer would like to help make it easier to use. It might be possible for open instances of Browsertix-Crawler to help scale-up the power of a crawl on larger websites. I suppose it might be possible to share results of crawls between co-operating instances too, at some stage.

For example, you can create a dump of all of Wikipedia, about 60GB, compressed into a .zim, and then browse it offline.

Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.

Cloudron Forum

Browsertrix Crawler on Cloudron