Using TAR for Asian Language Discovery

By John Tredennick on October 29, 2018

In the early days, many questioned whether technology assisted review (TAR) would work for non-English documents. There were a number of reasons for this but one fear was that TAR only “understood” the English language.

Ironically, that was true in a way for the early days of e-discovery. At the time, most litigation support systems were built for ASCII text. The indexing and search software didn’t understand Asian character combinations and thus couldn’t recognize which characters should be grouped together in order to index them properly. In English (and most other Western languages) we have spaces between words, but there are no such obvious markers in many Asian languages to denote which characters go together to form useful units of meaning (equivalent to English words).

Over time, litigation support systems advanced and added the capability to recognize different languages, handle Unicode rather than ASCII text and, ultimately, to properly search words or phrases in the challenging CJK languages (Chinese, Japanese and Korean) as well as other non-Western languages such as Arabic, Hebrew and those in the Cyrillic alphabets. Catalyst, for example, upgraded its search system in 2008 to handle multi-language discovery.

We faced the same issues when the first TAR 1.0 systems hit the market in the 2010s. They, like many of their underlying search systems, were still not built to handle the more challenging Asian and Eastern languages and many questioned whether they had any utility for foreign language discovery. In 2014, the U.S. Department Of Justice published a memo that talked about the importance of TAR for their investigations but expressed skepticism about the use of TAR for mixed or non-English language collections.

TAR 2.0 and Multi-Language Discovery

All that changed in the TAR 2.0 era. To understand why TAR can work with non-English documents, you need to know two basic points:

TAR doesn’t understand English or any other language. It uses an algorithm to associate words/tokens with relevant or irrelevant Documents.
To use the process for non-English documents, particularly those in Chinese and Japanese, the TAR system has to first tokenize the document text so it can identify individual words.

We will hit these topics in order.

1. TAR doesn’t understand English.

It is beyond the province of this blog to provide a detailed explanation of how TAR algorithms work, but a basic explanation will suffice for our purposes. (You can learn more in TAR for Smart People, Third Edition, which is available in downloadable PDF format.) Let me start with this: the TAR algorithm doesn’t understand English or any other langauge, or the actual meaning of documents. Rather, it simply analyzes words (tokens) according to their frequency in relevant documents compared to their frequency in non-relevant documents.

Think of it this way: We train the system by marking documents as relevant or irrelevant. When we mark a document relevant, the computer algorithm analyzes the words in that document and ranks them based on frequency, proximity and weight. When we mark a document non-relevant, the algorithm does the same, this time giving the words a negative score. At the end of the training process, the computer sums up the analysis from the individual training documents and uses that information to build a model it can use to analyze a larger set of documents.

While algorithms work differently, think of a TAR system as creating searches using the words developed during training. Only some of the words will have a positive weight and some of the words will have a negative weight. Documents responding to the search are then returned in an ordered ranking, with the most likely relevant ones coming first.

None of this requires that the computer know the English language or understand anything about the meaning of the returned documents. All the computer needs to know is which words are in which documents and how often they appear.

2. If documents are properly tokenized, the TAR process will work best.

Tokenization may be an unfamiliar term to many but it is not difficult to understand. When a computer processes documents for search, it pulls out all of the words and places them in a combined index. When you run a search, the computer doesn’t go through all of your documents one by one. Rather, it goes to an ordered index of terms to find out which documents contain which terms. That’s why search works so quickly. Even Google works this way, using huge indexes of words.

As we mentioned, however, the computer doesn’t understand words or even that a word is a word. Rather, for English documents it identifies a word as a series of characters separated by spaces or punctuation marks. Thus, it recognizes the words in this sentence because each has a space (or a comma or a period) before and after it. Since not every group of characters is necessarily an actual “word,” information retrieval scientists call these groupings “tokens.” They call the act of recognizing these tokens for inclusion in an index as “tokenization.”

All of these are tokens:

Bank
door
12345J
barnyard
mixxpelling

If they are separated by a space or punctuation, they will be recognized by the indexer and cataloged as tokens (words or otherwise) in the index.

Certain languages, such as Chinese, Japanese and Korean, don’t delineate words with spaces or Western punctuation. Rather, they string individual characters together, often with no breaks at all. It is up to the reader to group characters into words or phrases in order to understand their meaning.

Here is an example:

I do not speak Japanese.

In English it is easy to distinguish the words (tokens) because they are separated by spaces. With this same sentence in Japanese, it is much more difficult to determine where individual words begin and end.

にほんごをはなしません。

Special tokenization software is required to determine what language is being used (Japanese in this case) and which characters should be grouped together to form what we think of as “words” in English. Properly tokenized, the Japanese sentence would look this way:

にほんごをはなしません。

And index it thusly:

にほんご
を
はなし
ません

It is easy to imagine that a good TAR system can find relevant documents more efficiently when the words or phrases are properly tokenized.

TAR 1.0 Tokenization

TAR 1.0 systems were focused on English-language documents and could not tokenize Asian text. As a result, they defaulted to treating each character as a token, or taking the whole line of text (which could include parts of several sentences) as a token. Using English for analogy it would be like indexing the sentence above (“I do not speak Japanese”) using the following tokens with their respective counts:

i 1
d 1
o 2
n 2
t 1
s 2
p 2
e 3
a 3
k 1
j 1

Maybe that approach could surface some relevant documents based on character overlap alone. However, you can be certain the system will not be accurate or effective using this approach.

At Catalyst, we added special language identification and tokenization software to make sure that we handle these languages properly. As a result, our TAR system can analyze Chinese, Japanese and Korean documents just as well as English documents. Word frequency counts are just as effective for these documents and the resulting rankings are as effective as well.

Conclusion

As corporations grow globally, legal matters are increasingly likely to involve non-English language documents. Many believed that TAR was not up to the task of analyzing non-English documents. The truth, however, is that with the proper technology and expertise, TAR can be used with any language, even historically difficult languages such as Chinese and Japanese.

We know from experience that a TAR 2.0 system with proper language identification (required to know how to tokenize) and a strong tokenization library can be as effective with non-English languages as with English. To go further on this subject, read our blog post, Using TAR Across Borders: Myths & Facts, and our case study, 57 Ways to Leave Your (Linear) Lover, which involved a Japanese language document collection.

Whether for English or non-English documents, the benefits of TAR are the same. By using computer algorithms to rank documents by relevance, lawyers can review the most important documents first, review far fewer documents overall, and ultimately cut both the cost and time of review. In the end, that is something their clients will understand, no matter what language they speak.

This is an excerpt from TAR for Smart People, Third Edition, which is available in print and in a downloadable PDF format.