Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


    Cloudron Forum

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular

    grab-site

    App Wishlist
    4
    4
    82
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • robi
      robi last edited by

      https://github.com/ArchiveTeam/grab-site

      grab-site

      Build status

      grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses a fork of wpull for crawling.

      grab-site gives you

      • a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

      • the ability to add ignore patterns when the crawl is already running. This allows you to skip the crawling of junk URLs that would otherwise prevent your crawl from ever finishing. See below.

      • an extensively tested default ignore set (global) as well as additional (optional) ignore sets for forums, reddit, etc.

      • duplicate page detection: links are not followed on pages whose content duplicates an already-seen page.

      The URL queue is kept on disk instead of in memory. If you're really lucky, grab-site will manage to crawl a site with ~10M pages.

      dashboard screenshot

      Life of Advanced Technology

      murgero L 2 Replies Last reply Reply Quote 3
      • murgero
        murgero App Dev @robi last edited by

        @robi This looks very interesting!

        --
        https://urgero.org
        ~ Professional Nerd. Freelance Programmer. ~
        Matrix: @murgero:urgero.org

        1 Reply Last reply Reply Quote 1
        • jdaviescoates
          jdaviescoates last edited by

          Useful utility.

          This free tool https://www.httrack.com/ does this very well too.

          I use Cloudron with Gandi & Hetzner

          1 Reply Last reply Reply Quote 2
          • L
            LoudLemur @robi last edited by LoudLemur

            @robi grab-site is a great suggestion and I hope Cloudron supports it. @jdaviescoates makes a good recommendation too.

            After the website is grabbed, the next phase is reading and searching it offline. I don't know if you have had much joy trying that with grab-site.

            If grab-site can be supported, it is not very far from being able to support YaCy too, which also visits websites and crawls the pages. There is a request for YaCy support on Cloudron here:

            https://forum.cloudron.io/topic/2715/yacy-decentralized-web-search?_=1673430654350

            1 Reply Last reply Reply Quote 0
            • First post
              Last post
            Powered by NodeBB