Reviewing the data and the thousands of electronic documents requested or produced during discovery in litigation, in response to a subpoena or in an internal investigation, takes time. As we all know, time is money.

 

Leveraging technology goes a long way to helping cut down on time and costs. In the end, regardless if done manually or with the help of document review software, the first step in reducing review costs is culling data sets to include only potentially relevant documents. When we tackle large document reviews for our clients there are three techniques we use when culling massive data sets. These techniques limit our review to only those documents that are most likely to be relevant to the project.

 

(These are obviously just a few suggestions and does not include other culling methods such as the use of technology assisted review (TAR) or artificial intelligence).

 

Leveraging Metadata Fields

Using metadata fields is a goldmine to cull redundant and irrelevant data.  Some good metadata fields to utilize include: file type, file size, email domain, date, and custodian.

 

File Name or Email Subject

If your metadata contains file names or email subject fields, it’s easy to use these fields to filter irrelevant data.  For example, many companies send out weekly newsletters that might be irrelevant to your case. These newsletters usually contain the same email subject.  While going through the documents in your database, noting any generic and irrelevant titles can help cut down on document review volume.

 

Other irrelevant data that may be identified via email subject fields: Out of office notices, travel itineraries, meeting requests, conference agendas, HR and administrative documents, and “do not reply” emails.

 

You can also leverage file names to easily identify irrelevant documents.  Scroll through the file names and pick out key words that you can use to search in the file name field to pull up additional irreverent documents to cull.  Make sure you sample the results to make sure all data pulled is actually irrelevant.

 

Additionally, using file names can be helpful in finding duplicates and near duplicates.  Often times the file name of documents are the same or near duplicates are marked by version numbers.

 

Extracted Text Size

Extracted text size is not always a metadata field that is populated. However scripts may be used to populate this metadata field.  For instance, in Relativity, this script is pre-made in the applications library. Once this script is run it populates the size of the extracted text in kilobytes. This is helpful because culling only by file size may be deceiving as files with zero content can still carry a larger file size. Using the extracted text to determine the size of the document provides you with another data point to filter out trivial data.

 

To use this culling method, try filtering your extracted text size by 0 or other small sizes, then check out the resulting documents.  I suggest using extracted text size in conjunction with filtering for documents that are small in file size as well. I also highly recommend reviewing a statistical sample before determining the values you filtered by are appropriate to cull entire sets of search results.

 

Clustering

Clustering is a feature that groups similar documents by content and concepts. Many document review platforms have clustering capabilities.  Running clustering on your documents enables you to find common themes without even looking at a single document.  The database will highlight the key concepts for each cluster. Once you view the concepts you will be able to identify clusters of documents that are not relevant to the case matter and can be culled.  

 

An additional way to find irrelevant documents with clustering is examining unclustered documents. Often, these documents were not pulled into cluster groups because they do not have enough searchable or relevant data.  Groups of unclustered documents are often irrelevant and can be culled.  

 

Clustering can also be leveraged during the document review phase.  Setting up a dashboard with cluster visualization to look at responsive documents allows you to quickly see if there are certain clusters that are consistently marked as not responsive.  Clicking on the cluster you want to analyze will breakdown the responsiveness for that cluster.  If a cluster contains only irrelevant documents you can hone into that cluster to analyze if the remaining untagged documents within that cluster are irrelevant.

 

Near Duplicate Analysis

Nowadays, using deduplication is standard operating procedure to cull datasets before document review.  Using near duplicate analysis can also be very helpful in the review process. 

 

Near duplicate analysis organizes documents first by size, largest to smallest, and then assigns the largest document as a principal document. All other documents are compared and ranked by similarity percentage to this principal document.  Documents similar to the principal document are grouped together.

 

The textual near duplicate similarity percentage will give you an accurate similarity measure of documents in a group. The higher the percentage the more similar the document is.  As you begin culling data and identifying documents that are not relevant you can use the near duplicates feature to pull in all documents with a high textual near duplicate similarity percentage and mass code them.

Note: It is important to analyze that the documents are similar enough to the principle document to apply the same coding.