Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


    Cloudron Forum

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular

    Browsertrix Crawler on Cloudron

    App Wishlist
    browsertrix crawler zim kiwix archive
    1
    1
    140
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • L
      LoudLemur last edited by LoudLemur

      "Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses puppeteer-cluster and puppeteer to control one or more browsers in parallel."

      One brilliant use for this tool is demonstrated by Zimit, which enables you to nominate a website, crawl it and then archive its webpages into .zim file which is readable offline in Kiwix.

      https://github.com/webrecorder/browsertrix-crawler
      GPL v3
      Docker Image is available

      https://webrecorder.net/tools#browsertrix

      Zimit:
      https://youzim.it
      Kiwix:
      https://kiwix.org
      .zim
      https://www.openzim.org/wiki/ZIM_file_format

      This is a complex piece of software and the busy maintainer would like to help make it easier to use. It might be possible for open instances of Browsertix-Crawler to help scale-up the power of a crawl on larger websites. I suppose it might be possible to share results of crawls between co-operating instances too, at some stage.

      For example, you can create a dump of all of Wikipedia, about 60GB, compressed into a .zim, and then browse it offline.

      1 Reply Last reply Reply Quote 2
      • First post
        Last post
      Powered by NodeBB