Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Search
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

Cloudron Forum

Apps | Demo | Docs | Install
  1. Cloudron Forum
  2. App Wishlist
  3. grab-site

grab-site

Scheduled Pinned Locked Moved App Wishlist
4 Posts 4 Posters 2.0k Views 4 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • robiR Offline
    robiR Offline
    robi
    wrote on last edited by
    #1

    https://github.com/ArchiveTeam/grab-site

    grab-site

    Build status

    grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses a fork of wpull for crawling.

    grab-site gives you

    • a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

    • the ability to add ignore patterns when the crawl is already running. This allows you to skip the crawling of junk URLs that would otherwise prevent your crawl from ever finishing. See below.

    • an extensively tested default ignore set (global) as well as additional (optional) ignore sets for forums, reddit, etc.

    • duplicate page detection: links are not followed on pages whose content duplicates an already-seen page.

    The URL queue is kept on disk instead of in memory. If you're really lucky, grab-site will manage to crawl a site with ~10M pages.

    dashboard screenshot

    Conscious tech

    murgeroM L 2 Replies Last reply
    3
    • robiR robi

      https://github.com/ArchiveTeam/grab-site

      grab-site

      Build status

      grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses a fork of wpull for crawling.

      grab-site gives you

      • a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

      • the ability to add ignore patterns when the crawl is already running. This allows you to skip the crawling of junk URLs that would otherwise prevent your crawl from ever finishing. See below.

      • an extensively tested default ignore set (global) as well as additional (optional) ignore sets for forums, reddit, etc.

      • duplicate page detection: links are not followed on pages whose content duplicates an already-seen page.

      The URL queue is kept on disk instead of in memory. If you're really lucky, grab-site will manage to crawl a site with ~10M pages.

      dashboard screenshot

      murgeroM Offline
      murgeroM Offline
      murgero
      App Dev
      wrote on last edited by
      #2

      @robi This looks very interesting!

      --
      https://urgero.org
      ~ Professional Nerd. Freelance Programmer. ~

      1 Reply Last reply
      1
      • jdaviescoatesJ Offline
        jdaviescoatesJ Offline
        jdaviescoates
        wrote on last edited by
        #3

        Useful utility.

        This free tool https://www.httrack.com/ does this very well too.

        I use Cloudron with Gandi & Hetzner

        1 Reply Last reply
        2
        • robiR robi

          https://github.com/ArchiveTeam/grab-site

          grab-site

          Build status

          grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses a fork of wpull for crawling.

          grab-site gives you

          • a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

          • the ability to add ignore patterns when the crawl is already running. This allows you to skip the crawling of junk URLs that would otherwise prevent your crawl from ever finishing. See below.

          • an extensively tested default ignore set (global) as well as additional (optional) ignore sets for forums, reddit, etc.

          • duplicate page detection: links are not followed on pages whose content duplicates an already-seen page.

          The URL queue is kept on disk instead of in memory. If you're really lucky, grab-site will manage to crawl a site with ~10M pages.

          dashboard screenshot

          L Offline
          L Offline
          LoudLemur
          wrote on last edited by LoudLemur
          #4

          @robi grab-site is a great suggestion and I hope Cloudron supports it. @jdaviescoates makes a good recommendation too.

          After the website is grabbed, the next phase is reading and searching it offline. I don't know if you have had much joy trying that with grab-site.

          If grab-site can be supported, it is not very far from being able to support YaCy too, which also visits websites and crawls the pages. There is a request for YaCy support on Cloudron here:

          https://forum.cloudron.io/topic/2715/yacy-decentralized-web-search?_=1673430654350

          1 Reply Last reply
          0
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


          • Login

          • Don't have an account? Register

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • Bookmarks
          • Search