Browsertrix Crawler on Cloudron
-
"Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses puppeteer-cluster and puppeteer to control one or more browsers in parallel."
One brilliant use for this tool is demonstrated by Zimit, which enables you to nominate a website, crawl it and then archive its webpages into .zim file which is readable offline in Kiwix.
https://github.com/webrecorder/browsertrix-crawler
GPL v3
Docker Image is availablehttps://webrecorder.net/tools#browsertrix
Zimit:
https://youzim.it
Kiwix:
https://kiwix.org
.zim
https://www.openzim.org/wiki/ZIM_file_formatThis is a complex piece of software and the busy maintainer would like to help make it easier to use. It might be possible for open instances of Browsertix-Crawler to help scale-up the power of a crawl on larger websites. I suppose it might be possible to share results of crawls between co-operating instances too, at some stage.
For example, you can create a dump of all of Wikipedia, about 60GB, compressed into a .zim, and then browse it offline.
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login