They take a dressed-up version of a static “screenshot” image and save it as a time-stamped PDF file.
Both of these approaches have serious flaws that have grown more obvious as the web has grown more complex. If you’re responsible for allocating part of your annual budget to procure technology that will solve your organization’s web archiving challenges, you have to ask yourself one question: do I have the risk appetite to buy technology that won’t work all the time and could expose my company to regulatory failure and further scrutiny?
The biggest and most dangerous problem with relying on APIs to archive web and social content is simply that they leave you at the mercy of the API provider. An API might change for countless reasons—and when it does, technology that relies on that API will “break” until it can be re-engineered. A prime example of this took place in 2018, the year of data privacy awareness, GDPR regulation, and the Cambridge Analytica scandal.
“With all of the recent attention on how companies gather, mine, and use personal data, many platforms have disabled or limited their API access to control external access to private user data. Facebook announced in April that it would limit the user data that developers could access; it then cut API access further in July. Instagram’s API access was similarly limited, as was Twitter’s.” – Evan Gumz, Hanzo
The one silver lining with an API is that it in most cases it preserves the original context and interactivity of a piece of online content, which is something we can’t say for the PDF screenshot approach. The truth is that Facebook, Instagram, LinkedIn, Twitter, and your awesome new website cannot be translated into a piece of paper. When vendors that use this approach originally introduced their archiving technology, it was primarily used for email and a more basic, text-based web browsing experience.
Since then, the nature of the web has changed, growing more interactive, video-based, and dynamic—but the technology they want to sell you has, at its core, stayed the same. This creates unacceptable levels of risks for your organization.
First and foremost, if you try to translate your website and social media presence to a static piece of paper, there is a 99.9 percent chance that all of the content on those pages will not be captured, archived, and preserved. Right off the bat, from day one, the fidelity and accuracy of your archives are compromised.
Next, the context in which those messages, communications, and content were shared with your customers has been stripped away in the conversion to a static image—which FINRA, the SEC, and other regulators really don’t like.
Lastly, precious data and metadata found within the website and social media pages you’re trying to capture are likely to suffer one of two fates. They may be lost entirely in the translation from digital to PDF, leaving you with an incomplete picture of the facts. Alternatively, they may be easily tampered with, rendering them inadmissible as evidence in court.
All told, these two fundamentally flawed archiving approaches put your organization at risk of both not capturing and preserving everything you intend to and losing essential context and data within the files that you actually do capture.
Hanzo’s CEO, Kevin Gibson, summarized these problems quite eloquently in a recent article he wrote for Artificial Lawyer:
“Web data differs from every previous form of communication in that it is dynamic. Unlike email or Word documents, the internet has never been a paper-based medium; online data cannot be reduced to paper without losing critical information. The very nature of the internet is its interconnectedness and ceaseless changeability. The linkages between pages have inherent meaning, providing context, detail, and richness.”
Hanzo’s technology was built with the dynamic nature of today’s internet experience in mind, as well as the regulatory and legal requirements around it.
Regulatory Requirements and Legal Defensibility: SEC, FINRA, and ISO
We’ve covered some of the technical reasons why you shouldn’t choose other web archiving vendors, but there are legal and regulatory reasons to consider too. In a perfect world, the data and content you archive and preserve will never have to be demonstrated to a regulator or used as evidence in an investigation or trial. But, unfortunately, we don’t live in a perfect world, and compliance teams can’t afford to operate in an environment where they’re only circumstantially mitigating risks and complying with regulations.
Let’s start with ISO 28500. Most compliance and risk professionals are familiar with the International Organization for Standardization, but perhaps not with this specific standard. Originally published in 2009 and updated in 2017, it establishes a standardized file format for the collection of navigable websites without any loss of information which might equate to spoliation of data and evidence.
There’s a little-known fact about the WARC file and ISO 28500—Hanzo’s founders were among those involved in creating and establishing this file type and demonstrating why it’s the gold standard for storing, managing, and preserving billions of saved web pages in a universally recognized file format.
If you don’t use WARC files, you could find yourself in the situation demonstrated by Leidig v. BuzzFeed, where the court deemed one party’s archived content to be useless as evidence. Even better, the beauty of archiving web and social content with WARC files is that it brings you in compliance with FINRA and SEC rules and regulations, as we’ve detailed in our article on FINRA Regulatory Requirements for Archiving, Recordkeeping, and Supervision in 2019.
For example, under FINRA Regulatory Notice 11-39, a firm “may not establish a link to any third-party site that the firm knows or has reason to know contains false or misleading content. A firm should not include a link on its website if there are any red flags that indicate the linked site contains false or misleading content.”
What this Regulatory Notice means is essentially that for any link your website makes to another website, you need to archive the linked page too, or you’re at risk. A PDF may not capture the address for the link, much less the actual content of the referenced page. Additionally, the link in question could change, or the page could be deleted, removing essential evidence and data from your archive.
At Hanzo, we call these links “hops.” Anything we archive can include a “hop” to a linked page, and another hop after that. That allows you to:
(a) follow third party and social media links;
(b) archive linked pages in their native format at the moment they are referenced, in case they are deleted or altered in the future;
(c) capture those pages in the context in which they are referenced, so the full story and experience of navigating to that piece of information, including how it was presented, is fully preserved.
A similar rule, and concern, applies to social media. FINRA Regulatory Notice 17-18 states that “By sharing or linking to specific content, the firm has adopted the content and would be responsible for ensuring that, when read in context with the statements in the originating post, the content complies with the same standards as communications created by, or on behalf of, the firm.” That means you need to capture “hops” and linked content on social media too.
Now let’s turn our attention to the SEC, specifically SEC Rule 17a-4, which lays out the details and criteria for how electronic communications need to be preserved. It requires that:
Archived electronic communications are stored in a format that is non-rewritable and non-erasable, also known as WORM (write once, read many).
There is a way to verify the quality and accuracy of the archived content.
The archives are serialized, retained on electronic storage media, and downloadable to any other accepted medium.
Hanzo’s web archiving technology, and the WARC files at the heart of it, check all of these boxes—because that’s what they were designed to do. By contrast, the PDFs and other image file formats that other vendors use may not meet any regulatory requirements or may meet some but not all of those requirements. When exploring other vendors and weighing your options, make sure you ask whether they use WARC files. If they don’t, ask them why. If they can’t give you a good answer, you may want to reconsider whether they’re the right fit for your organization.
Trust, Longevity, and Quality Assurance
We recently conducted an archive risk analysis for an organization working with another vendor. Upon completing that analysis, we discovered that the majority of their website was not being captured and archived, despite their expectation that it would be. The core of this problem wasn’t any malicious attempt on the vendor’s part to mislead the organization. Rather, the vendor took a faulty approach, relying on a sitemap to determine what to capture, and then magnified its error through a lack of quality assurance (QA) and testing. When it comes to risk management and compliance, it’s always easier to prevent and prepare for what you know; it’s the blind spots, and the false sense of confidence they enable, that can cause the biggest problems.
Here’s the thing: sitemaps change, and frankly, they aren’t always 100 percent accurate. Whenever a new page is added to your website, it complicates the existing sitemap model, which can result in missing content and data on both the front end and the back end. In other words, working with a vendor that relies exclusively on sitemap information without implementing a certain standard of quality control can result in days, weeks, or even years of missing data and unarchived content.
It’s also essential, over the long run, to be able to manage and search your entire archive with specific criteria, keywords, and Boolean operators that help you find relevant information.
Hanzo’s crawling technology automatically discovers new web pages created by your team and adapts to capture those pages. It can also deliver alerts when certain keywords do or don’t appear, and it’s capable of searching for conditional information across millions of archived pages.
During our onboarding process, we aren’t just analyzing the accuracy of your crawl and determining whether it’s capturing the pages you need. We’re also measuring the fidelity and purity of the archived content against its live counterpart. This level of qualitative QA goes well beyond the numbers, ensuring that the content you capture is actually usable, and valuable, in the future.
Last, but certainly not least, we have the ability to customize our technology and write new bespoke code to meet a unique need or use case. For instance, we might create code to archive a very specific, personalized path on your website that someone would encounter if they provided certain information and criteria.
If you have a problem that isn’t solved by our existing web archiving technology, Hanzo Dynamic Capture, or our wider technology set, Hanzo Dynamic Archive, we’re confident that we’ll be able to custom build a solution that does exactly what you need it to.
So, where do you go from here? Bookmark this article for future reference or send it to the person responsible for these decisions within your organization. If you want to experience firsthand how our archives are different, step into our Time Machine to navigate a set of archived content we captured in March 2019. We’ve got content there from MarketWatch, SB Nation, Daily Beast, and BBC’s news websites.
Ready to start a conversation with an archiving expert at Hanzo? Complete the forms on these pages to request a free consultation about your website or your social media profiles and learn more about how we can help.