<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[grab-site]]></title><description><![CDATA[<p dir="auto"><a href="https://github.com/ArchiveTeam/grab-site" target="_blank" rel="noopener noreferrer nofollow ugc">https://github.com/ArchiveTeam/grab-site</a></p>
<h1>grab-site</h1>
<p dir="auto"><a href="https://travis-ci.org/ArchiveTeam/grab-site" target="_blank" rel="noopener noreferrer nofollow ugc"><img src="https://camo.githubusercontent.com/d1fc302b67c288053876c6658493360ea580169b74c3a516dc86b92bbea50ed4/68747470733a2f2f696d672e736869656c64732e696f2f7472617669732f417263686976655465616d2f677261622d736974652e737667" alt="Build status" class=" img-fluid img-markdown" /></a></p>
<p dir="auto">grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write <a href="https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem" target="_blank" rel="noopener noreferrer nofollow ugc">WARC files</a>. Internally, grab-site uses <a href="https://github.com/ArchiveTeam/ludios_wpull" target="_blank" rel="noopener noreferrer nofollow ugc">a fork</a> of <a href="https://github.com/chfoo/wpull" target="_blank" rel="noopener noreferrer nofollow ugc">wpull</a> for crawling.</p>
<p dir="auto">grab-site gives you</p>
<ul>
<li>
<p dir="auto">a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.</p>
</li>
<li>
<p dir="auto">the ability to add ignore patterns when the crawl is already running. This allows you to skip the crawling of junk URLs that would otherwise prevent your crawl from ever finishing. See below.</p>
</li>
<li>
<p dir="auto">an extensively tested default ignore set (<a href="https://github.com/ArchiveTeam/grab-site/blob/master/libgrabsite/ignore_sets/global" target="_blank" rel="noopener noreferrer nofollow ugc">global</a>) as well as additional (optional) ignore sets for forums, reddit, etc.</p>
</li>
<li>
<p dir="auto">duplicate page detection: links are not followed on pages whose content duplicates an already-seen page.</p>
</li>
</ul>
<p dir="auto">The URL queue is kept on disk instead of in memory. If you're really lucky, grab-site will manage to crawl a site with ~10M pages.</p>
<p dir="auto"><a href="https://raw.githubusercontent.com/ArchiveTeam/grab-site/master/images/dashboard.png" target="_blank" rel="noopener noreferrer nofollow ugc"><img src="https://raw.githubusercontent.com/ArchiveTeam/grab-site/master/images/dashboard.png" alt="dashboard screenshot" class=" img-fluid img-markdown" /></a></p>
]]></description><link>https://forum.cloudron.io/topic/8420/grab-site</link><generator>RSS for Node</generator><lastBuildDate>Tue, 21 Apr 2026 05:00:30 GMT</lastBuildDate><atom:link href="https://forum.cloudron.io/topic/8420.rss" rel="self" type="application/rss+xml"/><pubDate>Tue, 10 Jan 2023 23:24:35 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to grab-site on Wed, 11 Jan 2023 09:59:25 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/robi" aria-label="Profile: robi">@<bdi>robi</bdi></a> grab-site is a great suggestion and I hope Cloudron supports it. <a class="plugin-mentions-user plugin-mentions-a" href="/user/jdaviescoates" aria-label="Profile: jdaviescoates">@<bdi>jdaviescoates</bdi></a> makes a good recommendation too.</p>
<p dir="auto">After the website is grabbed, the next phase is reading and searching it offline. I don't know if you have had much joy trying that with grab-site.</p>
<p dir="auto">If grab-site can be supported, it is not very far from being able to support YaCy too, which also visits websites and crawls the pages. There is a request for YaCy support on Cloudron here:</p>
<p dir="auto"><a href="https://forum.cloudron.io/topic/2715/yacy-decentralized-web-search?_=1673430654350">https://forum.cloudron.io/topic/2715/yacy-decentralized-web-search?_=1673430654350</a></p>
]]></description><link>https://forum.cloudron.io/post/59850</link><guid isPermaLink="true">https://forum.cloudron.io/post/59850</guid><dc:creator><![CDATA[LoudLemur]]></dc:creator><pubDate>Wed, 11 Jan 2023 09:59:25 GMT</pubDate></item><item><title><![CDATA[Reply to grab-site on Wed, 11 Jan 2023 05:11:38 GMT]]></title><description><![CDATA[<p dir="auto">Useful utility.</p>
<p dir="auto">This free tool <a href="https://www.httrack.com/" target="_blank" rel="noopener noreferrer nofollow ugc">https://www.httrack.com/</a> does this very well too.</p>
]]></description><link>https://forum.cloudron.io/post/59840</link><guid isPermaLink="true">https://forum.cloudron.io/post/59840</guid><dc:creator><![CDATA[jdaviescoates]]></dc:creator><pubDate>Wed, 11 Jan 2023 05:11:38 GMT</pubDate></item><item><title><![CDATA[Reply to grab-site on Tue, 10 Jan 2023 23:52:04 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/robi" aria-label="Profile: robi">@<bdi>robi</bdi></a> This looks very interesting!</p>
]]></description><link>https://forum.cloudron.io/post/59833</link><guid isPermaLink="true">https://forum.cloudron.io/post/59833</guid><dc:creator><![CDATA[murgero]]></dc:creator><pubDate>Tue, 10 Jan 2023 23:52:04 GMT</pubDate></item></channel></rss>