DIY Link Persistency

By David Whelan on August 13, 2019

internet-archive-wayback-machine-web-browser-extension-firefox-save-page-now.png

Persistent URLs (PURLs) were the topic of an early career academic interview presentation I gave. It went pretty well and so did the interview. Persistency of information has remained something I’m interested in. The #AALL19 session on web archives was a good update on some tools that are out there. I was particularly taken with HarvardLaw’s perma.cc and the Internet Archive’s Wayback Machine browser tool. It’s got me rethinking how I do links within my own content.

There are, of course, many persistency options. DOIs and Handles are common out there when you can control both the access point and the content itself. I am more interested in content you don’t control. It’s partly because, over 20+ years of running a web site, I know there’s a lot that I’ve linked to that is no longer there.

The initial thoughts of this post came from a recent Buzzfeed post. The gist was this: lots of news-related links link to content, and those domains may no longer belong to the same owner. If you are able to buy domains that housed content that, say, the New York Times linked too, you can take advantage of that link juice.

I’ve blogged before about grabbing an entire web site or capturing just changed pages. This post is just about link persistency, although you may want to use a couple of approaches to ensure access.

Preservation tools can ensure that you link to the content you meant to, even when the publisher disappears. I realize that one goal is to ensure the link goes to content, but I hadn’t really consider the aspect that it might not go to the right content.

Roll Your Own Perma

The presentation around Perma.cc was interesting. It’s an interesting Harvard Law School project that enables legal writers to make their references persistent. Judges writing opinions don’t have to worry about disappearing citations.

Some law libraries can get Perma.cc accounts but free accounts are only for academics and courts. Since our membership law library doesn’t fall into that bucket, it made me wonder about options to run our own site. Also, foreign law libraries might want to take a crack at this, whether or not they’re court libraries.

Fortunately, Perma is open source and you can create your own instance. There are some clear instructions on how to install and the basic technology you’ll need. Suffice it to say, I was hooked. In the end, that’s all I was, but it’s definitely something that I’ll keep simmering at the back of my head. After all, I ran my own link shortener for awhile so why not do the full content too.

The first step is to get a Docker instance, as recommended by the instructions. Not everyone’s a fan of Docker so you might also consider alternatives. But Docker’s Hub offers a single instance for free. That’s where I’d start.

I don’t mean to make this sound simple. But if you’re in a position to put together your own persistent link resource and have a solid business reason, it’s a path to investigate.

Docker’s Hub also lists images, which you can use to populate your Docker container. I think the Python official image meets the basic requirements to get Perma going. I’m making a note of all this mostly so I can return to it when I have the access.

You need the Docker Desktop to be able to interact with a Docker Hub instance, which I can’t run at work. And you need a Windows 10 Pro or Enterprise version – with Hyper V activated – so I was out of luck with my Windows 10 Home … at home.

It’s good to know it’s an option, though. You may be in an organization that can’t partner or join Perma.cc but still want to run something similar.

Internet Archive Browser Extensions

Perma’s a nice global resource but I can work on a smaller scale. I don’t make a habit of going back and fixing broken links on my blog. Like news stories, they’re of the moment and rarely accessed after the fact. In my case, the content that gets the most repeat visits is “how to” posts on technology or crafts.

One of the other AALL 2019 speakers was the Internet Archive’s Mark Graham, director of the Wayback Machine. I’m an infrequent user but was familiar with the site. I did not know about their browser extensions (Firefox | Chrome). In particular, I hadn’t used the Save Page Now feature.

It saves the page you’re on now. When you click. Pretty simple.

On the original web page, click the Wayback Machine extension and choose SAVE PAGE NOW.

You can see saved versions from the same menu. Recent version will take you back to the page you just saved, but show it on the Internet Archive site.

Retrieve an archived page through the extension by clicking RECENT VERSION.

You can also create an Internet Archive account to manage your saved links.

List of links in my Internet Archive profile.

What it means for me is that I can increasingly capture and link to permanent copies of the information that I link to on my site. Now I’ll just have to use some judgment about when to link to what.

For example, that Yale law journal article above, on the judicial citations, might disappear. Either the link or file name might change (it’s a PDF) or it might go behind some subscription wall. I’ve converted that one to a Wayback Machine link, which you’ll notice if you click on it.

But the Perma code, for example, or any of the Docker content is more changeable. I don’t want to send people to archived copies of links to technical files. Better for them to go to the original site and get a 404 or have to do some hunting. If the technology itself is gone, no need to create a red herring.

All in all, it’s made me think harder about persistency on a small scale – this blog – as well as what tools a law library might use to solve a persistency problem on a much larger scale. It’s useful that there are tools and partners out there to create a ready solution.