Reading Time: 10 minutes

I work in a service profession. There’s nothing worse than when something gets in the way of the customer being served. Sometimes it’s within our control to remedy and sometimes it’s not. Law libraries intermediate most of our electronic materials, so if a legal information provider fails to function, we don’t really have much to fall back on. We recently had a situation where the publisher had purposefully blocked us, which made things even more complicated.

When a legal publisher’s information is unavailable, the best we can do is make a decision about how to handle it. Is it a permanent issue? If so, we need to unlink the resource and perhaps provide some signage or other documentation to let people know it’s unavailable. The reality is that some legal publishers don’t license their content to law libraries. We were unable to license an estates and trusts module as well as the Litigator suite from Thomson Reuters at one library at which I worked. They felt a public law library was too much competition.

The worst situation to be in is to disclaim any responsibility to a customer. For sure, we can’t do anything about a permanent information loss. But we can put up signs or have staff mention to researchers and patrons that certain sites are unstable or unavailable. It may not change the end result but it helps to clarify expectations.

Permanent unavailability is easier to manage than temporary or inconsistent availability. This means you need some way to (a) diagnose when the issue occurs and (b) have the ability to mitigate the lack of availability. That can be hard when someone has dedicated time to be at our law library and can’t come back once the problem is sorted out.

The issue we had involved the U.S. District Court website for the Southern District of California. People who use our law library are often directed, by Southern District staff, to access the website for forms. This is a common use of court websites in every jurisdiction, I expect.

But when we accessed it, we got a web server status 403 error message. No explanation for why and no explanation for what to do next.

A screenshot of a web page in a browser that says "403 - Forbidden: Access is denied."
A screenshot of a URL on the Southern District of California’s web site.

Where Am I?

I’ve been around the web server block a bit so I knew that a 403 error was adjacent to a 404 error and generally the responsibility of the web server owner. Most of us have experienced that before, when a page is missing. How do we experience it? With a 404 error page that usually warns you that what you’re looking for isn’t there.

You experience web server status codes all of the time. When a web page is returned to your web browser, that’s a 200 status (“OK”). You don’t see the status because the web server is showing you the page instead.

Most websites realize you should do a bit to help the person find their way back. In fact, the one thing you don’t want to leave your visitors with is a technical error code. If someone reaches your website with an error, you want to help them find their way to what they are looking for. It’s such a common web issue that there is plenty of good advice on how to do it. WordPress has a Codex page dedicated to how to make one for your site. Some web sites have a bit of fun with it:

Screenshot of a page that says "404 Looks like you've been diverted.  Re-route back home" with a Back Home button.
Screenshot of a 404 error page for an airline website. You’ve been diverted . . . get it?

This is the 404 page you’ll get on this blog if you have an unforwarded or misspelled link:

A screen shot of a 404 error page.  Below the header image and navigation menus there is a picture of an ostrich's face, with the words "oops, that page can't be found."  Next to the photo is a search box.
A screenshot of a user friendly 404 error page, explaining why it’s being displayed and providing access to both navigation and a search box.

When a web server throws up a 400 or 500 error, it is something that only the website owner can control. It’s a server error, not a web browser error. So when the error message is unhelpful or leaves a visitor at a dead end, it’s something that only they can intercept.

This is missing with the Southern District of California. You get a server error, without any navigation or a search box or any of the best practices for a 404 error page.

A screenshot of the Southern District of California district court's website 404 error page.
A screenshot of the Southern District of California district court’s website 404 error page.

Contrast it to their Northern District of California colleagues. If you type in a bad URL on that court’s web site, you get a custom page:

A screenshot of the Northern District of California's court website 404 error page.
A screenshot of the Northern District of California’s court website 404 error page. Note that it doesn’t say “404” anywhere. It just explains what happened and you can still see navigation.

It’s really not hard. There are only a handful of error messages that a website is going to return that inhibit researcher progress and you can make a template for each error and then leave it alone.

The error we were receiving was a 403 error, which is slightly different from a 404 status code. It’s a security error. It means that we do not have permission to access the files on that server or in that folder. This actually gives information they might not want to share, since now I know there’s a /forms/ folder on their website. It makes me kind of curious to see what is in it!

I just typed in the /forms/ URL. I didn’t know if it existed or not. If you are able to access the court’s website and get to the forms, you can mouse over a form name to see they’re stored at /_assets/pdf/forms/. You get the same 403 error if you try to browse that folder.

It means that the people managing the web server have blocked directory browsing. But they have not taken the time to put in place pages to intercept people who get to those URLs. It’s sloppy.

But Wait, There’s More

That didn’t explain everything though. We were typing in the court’s domain name only: https://www.casd.uscourts.gov. 403 error. Google search and clicked the link to the court. 403 error. Repeat both steps on Google Chrome. 403. Firefox. 403. Microsoft Edge. 403. All of this was very strange.

You wouldn’t block directory browsing on your domain name, you’d redirect people to your home page. Otherwise no one would ever see a website without knowing the exact page name. It’s a standard function in a web server configuration or .htaccess file.

In the end, I called the court and asked them if they knew their web server was throwing a 403 error. In hindsight, I realize this was not the next step I could have taken. But I explained our problem to the court clerk I spoke to and they told me that their website was working fine. I was given a couple of suggestions—don’t use Chrome, try to access the site using Google—that either didn’t make a difference or that I’d already tried.

The next time, this will be what I do first: test on a PC on the LAN, test on a phone, test on a mobile device on our public wifi. If all three are a failure, then I can realistically assume that it’s entirely on the external provider’s shoulders. It would be helpful if IT staff at places like the Southern District also let their clerks know that they were IP blocking and that it might hit users like law libraries.

It was at that point that I fell back on my technology background. How would you debug a web server error? You’d clear your cache, right (even though it shouldn’t matter if it’s a server-side error)? Try multiple web browsers (might work)? Then try multiple networks. I pulled out my phone, turned on data, and accessed the court website without a problem. Then I fired up a portable device and got on our law library’s public wifi network. Same thing, the court website loaded without an issue.

Huh?

At that point, I delegated. I kicked this over to our law library IT folks and they ended up calling the court IT staff because it really didn’t make sense how our technology could be causing their 403 error to appear. It turns out the court was blocking our patron and staff PCs by IP address. We were using the court’s website too much and it triggered some process that flagged us.

Now, if you’re like me, you’re thinking, well, a law library is going to probably use a Federal court website more than most individual IP addresses on the internet. But we only have dozens of patrons a day. So even if all of our patrons all hit the Federal court at the same time, with all of our staff also joining in, we’re not approaching DDOS levels. However much use we were generating, we’d made it onto a naughty list.

I’m not sure why we were put on a deny list. The only purpose of a court website is to provide access to information. If you decide that too much use is a problem, the way to resolve that is not to just block access. It’s to intercept it and inform the visitor about the issue.

But, you might ask, what other alternative do you have?

I’m glad you asked.

Intercept with Information

Let’s say that the court did not see as a threat but merely as an annoyance. I have experienced this with my personal website. There are bots that hammer your web server and other automated tools that, frankly, you don’t want to provide access to. How do you distinguish a legitimate resource from an automated tool? You can’t really.

The solution, then is to put a resource in between that allows the visitor to still access the content but that also blocks automated tools. You don’t just throw up a server error, because you don’t know who is seeing it. In fact, you’re more likely to frustrate a legitimate information seeker than the automated tools.

One way to do that is with a page like the 404 page. You can make them for 403 status codes too. There are plugins for content management systems like WordPress that will handle templates for a variety of error codes.

If I was at the Southern District, I would put a uniform intercept page in place. Right now, someone nefarious could determine the difference between what was missing (404) and what existed but was prohibited (403). If you have a universal page—I’m sorry, you can’t access what you tried to access, try again—you obscure the difference. I tried to use wget to pull the /forms/ URL but I don’t know enough to know what the error I’m getting means, but I would think that a universal page would get retrieved by a browser-emulating bot.

In general, someone who is hitting a 403 error code may not be using your website in the way it was intended. You can push some of that work out off your web server, if the reason you’re concerned is that it’s using up web server resources.

I use Cloudflare and I watched my security logs. A half dozen countries were regular sources of automated site visits. I watched their IP addresses and used whois to identify their sources. Then I created an access rule on Cloudflare that forced visitors from all of those countries to respond to a challenge. It eliminated unwanted traffic (2% of traffic to my site is snared by this rule) but I still see occasional, real visitors from those countries progress to the content they seek.

A screenshot of a table of information.  Each row shows an attempt to visit my website, the time of access, the country and IP address, and what happened to that request.
A screenshot of a Cloudflare rule result page showing a list of managed challenges from Russia.

In the case of Cloudflare, they show a captcha that requires you to click. It’s not onerous and it’s something that an automated bot is probably not able to manage. Here’s what it looks like for people visiting my website from an IP range or country that I’ve added to my rule.

A screenshot of a Cloudflare managed access request page, with a captcha box in the middle.  It explains why the person is seeing it and how to bypass it.
A screenshot of a Cloudflare managed access request page. It explains why the person is seeing it and how to bypass it.

Cloudflare has a free account level, which is the one that I use. This is not difficult technology to license and to your web server approach. In fact, I would be reluctant to run a website any longer unless I could put it behind a firewall like Cloudflare.

But you could even do it on your own website. I’ve used the Redirection and IP geo location tools to restrict and redirect website visitors on my own site. Rather than blocking an IP address or IP range and throwing it at a 403 error, you could intercept it and redirect it to an information page. Given the choice, I would shove this work off my web server and onto a service like Cloudflare.

In any event, problem solved. I think. We will just have to periodically monitor whether the Southern District is up for us. It’s not something you can remotely monitor, because the Court may not block the monitoring tool (I use Uptime, for example, to warn me if sites go offline). You may already do this within your integrated library catalog. When I was working in Canada, we rigged up a simple Google Drive spreadsheet to link check 856 fields.

A screenshot of a spreadsheet showing website URLs down the center and green or red on the right side, with green meaning the URL is returning a 200 status code.
A screenshot of a spreadsheet with URLs and test results for status codes on the right.

It requires adding a script to Google Drive like this one, although the Google Sheets interface has changed since those instructions. You can do the same thing with Excel but it looks like it’s a bit more tricky. I think you’d need to use Excel, and on a local machine, to have an accurate test (if it’s in the cloud, it may register a site as working when it’s not actually working from your PCs). But, once the spreadsheet is in place, you could open it each day to see if the site is showing 200 (available) or a 400-level status code.

You could always just open a browser and look. But if you have other websites or web publishers that became unstable, a spreadsheet might allow a more efficient process.

In any event, the work of the law librarian never ends. Sometimes that work is made harder when legal publishers or information providers do not ascribe to common information delivery expectations. We’ll have to monitor and mitigate as problems arise in the future.