ArchiveBox -- Personal Internet Archive
heliostatic last edited by
"ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more)."
Can import links from:
- Pocket, Pinboard, Instapaper
- RSS, XML, JSON, or plain text lists
- Browser history or bookmarks (Chrome, Firefox, Safari, IE, Opera, and more)
Shaarli, Delicious, Reddit Saved Posts, Wallabag, Unmark.it, and any other text with links in it!
Can save these things for each site:
- favicon.ico favicon of the site
- example.com/page-name.html wget clone of the site, with .html appended if not present
- output.pdf Printed PDF of site using headless chrome
- screenshot.png 1440x900 screenshot of site using headless chrome
- output.html DOM Dump of the HTML after rendering using headless chrome
- archive.org.txt A link to the saved site on archive.org
- warc/ for the html + gzipped warc file .gz
- media/ any mp4, mp3, subtitles, and metadata found using youtube-dl
- git/ clone of any repository for github, bitbucket, or gitlab links
- index.html & index.json HTML and JSON index files containing metadata and details
There's a Docker image, as well: https://github.com/pirate/ArchiveBox
robi last edited by robi
Came across this today, looks just like a Python script packaging wise.
infogulch last edited by
The dockerfile looks a bit involved, but it exposes some very handy ENV vars to customize the directories and user. An initial implementation might just override those and build using the project's own dockerfile.
Integration with https://amberlink.org/ would be interesting.