Practical Archiving at Home
As I have been testing/using/loving ArchiveBox, I’ve been working on developing a reusable docker script to maintain various archives. I’ve been using a Digital Ocean (referral link) Droplet with Ubuntu 19.10 and the latest Docker. This has been the easiest way to create an archive and with some creativity it allows for flexibility with what I collect and how I organize the archives.
I won’t go into configuring your droplet, but this is the script I’ve been using.
docker run -i \ -v ~/archives:/data \ --user $(id -u):$(id -g) \ --name archivebox \ nikisweeting/archivebox \ env FETCH_TITLE=True \ env FETCH_FAVICON=True \ env FETCH_WGET=True \ env FETCH_WARC=False \ env FETCH_PDF=False \ env FETCH_SCREENSHOT=True \ env FETCH_DOM=False \ env FETCH_GIT=True \ env FETCH_MEDIA=True \ env SUBMIT_ARCHIVE_DOT_ORG=True \ /bin/archive && docker rm archivebox
There are a few things to note about this script. First you have to create the
~/archives AND the
mkdir -p ~/archives/sources) folder otherwise it will error out and I can’t figure out why other than a permissions error. The user bit in there identifies your uid and gid to help minimize further permissions errors. The next bit names the container archivebox so it can be removed later. Then we specify the image to use which in this case is nikisweeting/archivebox. From there we are able to do the configuration of the archive and what we pull down. In my case, I don’t want WARC files, PDFs or the DOM output but your archive needs may vary. Note that if you have FETCH_MEDIA=True, this can potentially consume a lot of storage space so be aware of that. For further reading on the environmental variables, check out the wiki on configuration.
Once you have the script and Docker installed and working you can either run
echo 'https://collideoscope.org' | ./archivebox.sh and let it do the rest or if you have a text file of links you can let it parse everything out like this:
cat list_of_links.md | ./archivebox.sh. From there you will have your archive in the previously created ~/archives folder – just open the index.html in your browser.
Once you have your directories built out such as
~/random_links/sources you can create copies of the script and customize what you download for each archive. For example, if you don’t want PDFs for random_links or any videos you would change the flag in the script to False for both PDF and Media. From there, you could then setup an nginx or apache container to serve up the static files from the archive.
If you have any ideas on how to let the script/container automagically create the folders needed, please let me know! Enjoy!