Reading Time: 7 minutes

I am sensing a disturbance in the Force. Where things like Google alerts used to be reliable sources of news, their reliance on deteriorating search is becoming visible. Websites are also starting to pull their RSS feeds in favor of emailed newsletters. For every information action, a reaction. I have had to deploy an RSS extension to identify feeds and, when that fails, learn how to create my own by scraping a new feed.

RSS (RDF Site Summary or Really Simple Syndication) is a preferred information gathering tool for me. It moves information gathering out of my email inbox. It works in the background so that, rather than me visiting a bunch of sites, I just access my feed reader and catch up on what’s new from as many sites as I am following.

I prefer by far to just use pre-built RSS feeds. I am a bit uncomfortable bypassing the distribution choices media platforms make. At the same time, if the information is web accessible, I am fine using tools that allow me to access it for my own use, in my own way. Frankly, I think RSS is an oversight rather than an explicit decision. One finds old RSS feeds undocumented and old links to RSS feeds ending nowhere. If anything, media platforms seem to be unaware or unmindful of RSS as a delivery point.

First, A Reader

I use FreshRSS, a web-based feed reader. This will matter later if you want to scrape your own feeds and have more than a couple. But there are lots of feed readers.

You may use Outlook for it. You can add RSS feeds to your account, the same as you can add Outlook.pst or archive mailboxes to it.

A screenshot of the Outlook account dialog on the RSS feed tab. If you have an RSS or Atom feed, you can add them here.

I thought the Outlook RSS experience left a lot to be desired. Outlook seems to treat each RSS item in a similar way as an email, blocking some elements. When you add an RSS URL, you get a prompt from Outlook for some configuration choices. I selected to have it download the entire content, which seems to get around some of the information rendering problems I saw.

A screenshot of the Microsoft Outlook RSS feed configuration dialog box. There are options to automatically download enclosures as well as the full article.

Once you have an RSS reader, you can go hunting RSS feeds.

RSS Awareness

The ideal is if a website highlights its RSS feeds. Unfortunately, they usually highlight their social media platforms (Facebook and Twitter and so on). The Atlantic is a great example of how to do it right. The RSS feed icon—which looks like a waterfall or a radar sending out waves up and to the right—is among those of other commercial social media platforms.

A screenshot of social media icons above a footer that says The Atlantic. The RSS waterfall or radar icon is among them.
A screenshot of social media icons on the Atlantic website.

It’s a logical place to put it. Like social media feeds, you can follow the RSS feed. For me, the power of RSS is that you do not need to enter Meta’s or LinkedIn’s walled garden to see it. The publisher delivers it and allows me to opt in without any account information.

Associated Press used to provide a great list of RSS feeds. They are gone now and I’m not sure if that’s on purpose. I used to follow them in my RSS feed reader and they suddenly stopped working. I went looking for a page of feeds at the AP site but it had disappeared.

I have started using the Get RSS Feed URL web browser extension (Chrome and Edge). When you are on a web page that you think might have an RSS feed—where information is presented in a repetitive format, a list of items—you can click on it. It will scan the underlying web page code to see if an RSS feed is listed.

This worked great on The Chronicle of Higher Education. The site doesn’t make its RSS feeds obvious. But if you land on a topic page, you can find that it has an RSS feed in place.

A screenshot of a web browser window. At the top, an orange RSS icon has been clicked. A dialog has dropped down and has a link and a button that says copy URL. The web browser window contains content from the web site in a list of news items.
A screenshot of the Get RSS Feed URL extension finding a Chronicle RSS feed.

Here are a couple of other feeds I recently added in this way:

This discovery method did not work so well on the Associated Press site, a site I know used to have RSS feeds. In fact, if you install the extension and go to a landing page like World News, the RSS feed reader will find a feed. If you click the Copy URL button, though, the page will not be found.

A screenshot of the Get RSS Feed URL button finding a feed on the AP news site.

This is a good example of the publisher not knowing what’s on their platform. The reason the extension can find the feed is because it’s in the webpage’s code. If I had to guess, someone just deleted the old .RSS files without thinking about how people found them.

Here’s the HTML in the page:

<link type="application/rss+xml" rel="alternate" title="World News: Top &amp; Breaking World News Today" href="https://apnews.com/world-news.rss">

When the world gives you lemons, make lemonade.

Using XPath to Extract Web Page Content

I had never heard of XPath (XML Path Language) before. But this frustration around RSS not being where it was supposed to be or not being at all drove me to root around on the web for solutions. As is often the case, someone had already walked this … path! I won’t repeat their explanation but it uses FreshRSS so if you don’t use that app, you may need to explore if your RSS reader supports XPath.

If it does not, or if you only have a few sites you want to scrape, there are applications that will do it for you. I played around with RSS.app, which has a free account with 15 feeds. You can upgrade for a premium account for more feeds. Their free account would be a good, low tech, low friction way to start.

No matter which option you choose, you can now start to create RSS feeds by scraping the content from sites.

Not the Worst: American Bar Association

I started with the ABA’s website. It’s a great example of how the media platform is publishing something in a standard list but does not provide their own RSS. Their HTML is not great but it’s the repetition that matters.

I ended up grabbing feeds for the ABA Journal topic pages for Practice Technology, for Law Libraries, and Law Schools. One of the flexible benefits of scraping is that, if the site’s content management system puts out standard content, you can reuse the XPath..

In FreshRSS, you add the URL just as you would a new RSS feed. Instead of leaving the type as RSS/Atom, you flip it to HTML & XPath (Web scraping). All three of the Journal pages use the same HTML. FreshRSS can capture a variety of information but I found that you really only need a couple so that a title, description, and a link appear in the RSS reader.

This is the code in the ABA webpage. You can see it yourself if you go to one of the topic web pages linked above, right click on it, and click Inspect on the menu that appears.

A screenshot of the HTML code for an ABA Journal topic web page. It shows the hierarchy that will be reflected in XPath using the descendant element.

It took a bit of tinkering but here is the XPath to get the output from the ABA Journal site. You can compare the items below with the arrows pointing at the content above.

XPath for news items:
//div[@class="article-teaser"]

XPath for item title:
descendant::h3[@class="article_list_headline"]

Xpath for item content:
descendant::h3/a

XPath for item link (URL):
descendant::h3/a/@href

Once you save the XPath, you can test it. Refresh the feed and see what appears. If it is empty, then you are not yet gathering the right elements together. Using the items above, my ABA Journal Practice Technology feed looks like this in FreshRSS:

A screenshot of a FreshRSS feed showing ABA Journal practice technology content.

While the ABA’s website code isn’t beautiful, it’s reliable. It may change in the future but it is relatively clean and simple. The complexity of the code is what makes this scraping to be a bit fragile. It relies on the publisher leaving their content alone. If they change their styles or tags regularly, this would be challenging.

Harder: Christian Science Monitor

This one was a bit harder to manage. I was trying to get a feed of the Christian Science Monitor’s world news. Like the ABA Journal content, it’s easy to see that there is a pattern to the content. The HTML is just a bit more complicated and nested.

Here’s the XPath that worked for me. The somewhat non-standard use of classes made it a bit harder. But at least I learned that you can create a multi-level pointer for the overall news items container, the first item.

XPath for news items:
//ul[@class="list-unstyled"]/li

XPath for item title:
descendant::a/div/span[@class="content-title"]/span

Xpath for item content:
descendant::a/div/div/p

XPath for item link (URL):
descendant::a/@href

Hardest: Library Journal

I was very disappointed at the absolute gobbledy-gook in the Library Journal’s web pages. Surely they know some librarians. Lots of long, complicated style classes and many for each HTML tag. It took a long time to figure this one out but it led me to a solution that allows you to look for just one or part of a class name.

This is what I used to grab the Library Journal’s Technology page. You can look at the HTML to see why I had to do this. But when a <div> tag had three or four classes listed, this method allowed me to just look for and match one of them. Once you’ve closed the squared brackets, the use of slashes to identify levels works the same as above.

XPath for news items:  
//div[contains(concat(' ',@class,' '),' filter-story-section ')]

XPath for item title:
descendant::div[contains(concat(' ',@class,' '),' article-headline ')]/a/h3

Xpath for item content:
descendant::div[contains(concat(' ',@class,' '),' recommended-description ')]/p

XPath for item link (URL):
descendant::div[contains(concat(' ',@class,' '),' article-headline ')]/a/@href

This is what the output looks like:

A screenshot of FreshRSS showing Library Journal Technology content.

Impossible: Reuters

As a customer of Thomson Reuters, part of me is not surprised that I was unable to scrape it (yet). It’s not because it’s secured or professional or anything. The HTML is just awful. System-generated ID tags that are unique, so each news item is distinct from other ones. The class segments are long and area applied both to list items but also to the topic section and so on. It’s a little bit like having a plate of food and just slathering butter over everything, without distinguishing between the type of food it is.

A screenshot of a web browser view of a Reuters news page, with a news item highlighted, and the matching code shown on the right.

They have also just announced a very light paywall ($4 / month) and so they have no incentive to make their content more machine-readable. Unless they—and this goes for the AP as well—stop syndicating their news, there will always be a way around the paywall.

Next Steps

I have only been using XPath for awhile. It is not clear to me how stable these feeds are. If you add your feeds to a website or to a law firm intranet, you will want to monitor that aspect. Fragile news feeds may break and so, while the impact on your audience may just be that they don’t see new information, it can undermine any news sharing you are attempting. It’s the downside of scraping your own feeds, which opens up a much wider set of information resources.

The Reuters page is a good example of a nut I still want to crack. It presents the same challenge I’ve faced when trying to use the Stylus web browser extension. Sites like Twitter use classes and other code that makes it hard to distinguish, at a machine level, what the content is: a title, a link, a description, and so on. But the more I learn about XPath, the more I feel like I can find a solution.