Mirroring Protesilaos' videos to Internet Archive

I enjoy reading and watching the writings and videos that Protesilaos publishes on his website, with his work ranging from philosophy and various life issues to GNU Emacs and programming. Currently, Prot uploads his videos to YouTube and embeds them on his website. YouTube, diligently working their way down the spiral of enshittification, have been making it increasingly difficult to watch the videos without using their nonfree JavaScript interface or their nonfree mobile applications. This got me thinking about mirroring Prot’s videos to the Internet Archive to make them more easily accessible in freedom.

To mirror all of Prot’s videos to the Internet Archive is a nontrivial task: as of the time of this writing, there are a total of 298 videos uploaded to Prot’s YouTube channel. Thankfully, Prot makes publicly available the git repository containing the sources used to build his website, and we have several excellent tools at our disposal to help extract the information we need and carry this out.

Note: Prot publishes his works under free/libre copyleft licenses like CC BY-SA 4.0 and GPLv3+, so we do not violate his copyright by sharing or redistributing his work so long as we do it with proper credit, following the terms of the licenses.

The idea is to write a program that would walk through the set of markdown files in the source repository for Prot’s website and for each file that has a mediaid metadata field, download the video with that ID from YouTube using yt-dlp, and upload it along with accompanying metadata to the Internet Archive using the internetarchive Python module. Given that these two key tools are written in Python, I opted to use Python for my own implementation as well. (I initially started the implementation as a POSIX shell script, but then decided that I would like the convenience of a ‘proper programming language’ and being able to interact with these tools through their respective API, so I ported what I had to Python and continued there.)

The full implementation is available at protesilaos_videos_archive.py. Note that some of the required modules are not part of Python’s standard library, namely markdown, yt-dlp, and internetarchive. You can install these using your distribution’s package manager or using pip, the Python package manager.

The script takes several command line arguments. There is a required positional argument for specifying the directory to search through (recursively) for markdown files. Normally, this would be the path to your local copy of the source repository for Prot’s website. There are also two --cookie-file and --working-dir options for optionally specifying the path to a cookie file for use with yt-dlp and the working directory for storing the downloaded videos and the progress file, respectively. Considering YouTube’s somewhat aggressive rate-limiting of IPs, if you will be downloading a nontrivial number of videos, you will probably want to use --cookie-file to specify the file that contains cookies from a YouTube session. (You would log into YouTube using your account, then use an add-on like cookies.txt to extract and save your session’s cookies into a text file.)

Example invocation of the program:

./protesilaos_videos_archive.py --cookie-file=cf.txt ~/src/protesilaos.gitlab.io

Also, considering the large number of videos to be downloaded and uploaded, making this a long-running task, I thought it would be helpful to allow interrupting the work partway through by stopping the program by pressing Ctrl-c in the terminal to send a SIGINT. Upon receiving a SIGINT, the program will stop the work after the current download or upload is finished, and write the progress to a progress file, .pva-progress.jsonl, which it will use on the next run to resume the work where it was left off.

As of the time of this writing, all of the videos published by Prot on his YouTube channel have been mirrored to the Internet Archive, and are available from the Video Publications by Protesilaos Stavrou collection.

I’ll wrap up by thanking Prot for clarifying the license of his video publications and for his blessing for me to mirror them on the Internet Archive. Thanks, Prot. :)

Take care, and so long for now.

P.S. yt-dlp has a --write-description option, which causes it to write a .description file along with the downloaded video containing its description text from YouTube. I still opted to go with the above approach of using each post’s body text as ‘description’ in part because the markdown source file for each video post contains more metadata fields that I was planning on uploading to the Archive anyway.


Back to my home page