Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Search
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

Cloudron Forum

Apps | Demo | Docs | Install
  1. Cloudron Forum
  2. App Wishlist
  3. Browsertrix Crawler on Cloudron

Browsertrix Crawler on Cloudron

Scheduled Pinned Locked Moved App Wishlist
browsertrixcrawlerzimkiwixarchive
1 Posts 1 Posters 527 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Offline
    L Offline
    LoudLemur
    wrote on last edited by LoudLemur
    #1

    "Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses puppeteer-cluster and puppeteer to control one or more browsers in parallel."

    One brilliant use for this tool is demonstrated by Zimit, which enables you to nominate a website, crawl it and then archive its webpages into .zim file which is readable offline in Kiwix.

    https://github.com/webrecorder/browsertrix-crawler
    GPL v3
    Docker Image is available

    https://webrecorder.net/tools#browsertrix

    Zimit:
    https://youzim.it
    Kiwix:
    https://kiwix.org
    .zim
    https://www.openzim.org/wiki/ZIM_file_format

    This is a complex piece of software and the busy maintainer would like to help make it easier to use. It might be possible for open instances of Browsertix-Crawler to help scale-up the power of a crawl on larger websites. I suppose it might be possible to share results of crawls between co-operating instances too, at some stage.

    For example, you can create a dump of all of Wikipedia, about 60GB, compressed into a .zim, and then browse it offline.

    1 Reply Last reply
    2
    Reply
    • Reply as topic
    Log in to reply
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes


    • Login

    • Don't have an account? Register

    • Login or register to search.
    • First post
      Last post
    0
    • Categories
    • Recent
    • Tags
    • Popular
    • Bookmarks
    • Search