Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse

Cloudron Forum

Apps | Demo | Docs | Install

grab-site

Scheduled Pinned Locked Moved App Wishlist
4 Posts 4 Posters 168 Views
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • robiR Offline
    robiR Offline
    robi
    wrote on last edited by
    #1

    https://github.com/ArchiveTeam/grab-site

    grab-site

    Build status

    grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses a fork of wpull for crawling.

    grab-site gives you

    • a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

    • the ability to add ignore patterns when the crawl is already running. This allows you to skip the crawling of junk URLs that would otherwise prevent your crawl from ever finishing. See below.

    • an extensively tested default ignore set (global) as well as additional (optional) ignore sets for forums, reddit, etc.

    • duplicate page detection: links are not followed on pages whose content duplicates an already-seen page.

    The URL queue is kept on disk instead of in memory. If you're really lucky, grab-site will manage to crawl a site with ~10M pages.

    dashboard screenshot

    Life of sky tech

    murgeroM L 2 Replies Last reply
    3
  • murgeroM Offline
    murgeroM Offline
    murgero App Dev
    replied to robi on last edited by
    #2

    @robi This looks very interesting!

    --
    https://urgero.org
    ~ Professional Nerd. Freelance Programmer. ~
    Matrix: @murgero:urgero.org

    1 Reply Last reply
    1
  • jdaviescoatesJ Offline
    jdaviescoatesJ Offline
    jdaviescoates
    wrote on last edited by
    #3

    Useful utility.

    This free tool https://www.httrack.com/ does this very well too.

    I use Cloudron with Gandi & Hetzner

    1 Reply Last reply
    2
  • L Online
    L Online
    LoudLemur
    replied to robi on last edited by LoudLemur
    #4

    @robi grab-site is a great suggestion and I hope Cloudron supports it. @jdaviescoates makes a good recommendation too.

    After the website is grabbed, the next phase is reading and searching it offline. I don't know if you have had much joy trying that with grab-site.

    If grab-site can be supported, it is not very far from being able to support YaCy too, which also visits websites and crawls the pages. There is a request for YaCy support on Cloudron here:

    https://forum.cloudron.io/topic/2715/yacy-decentralized-web-search?_=1673430654350

    1 Reply Last reply
    0

  • Login

  • Don't have an account? Register

  • Login or register to search.
  • First post
    Last post
0
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Login

  • Don't have an account? Register

  • Login or register to search.