Understanding foreign language document reviews

foreign-language-3

At this past April’s International Legal Support Leaders Conference (ILSLC) in Washington, DC, a seminar was devoted to the foreign language component of e-discovery and document review hosted by John Tredennick, an attorney and founder and CEO of Catalyst Repository Systems.  He expounded on the “how tos” of these reviews.

As he pointed out, the common refer­ence point is Unicode – the buzz word on the lips of every e-discovery provider and in seemingly every press release of every litigation software and service provider.

We’ll try to summarize the “tech talk” as follows:

  • Unicode is the de facto standard for translating characters and symbols of written language – both English and other languages – into numerical values for processing on computers.
  • Software that doesn’t support Unicode may work on Unicode documents in English, but it most likely won’t work on documents in other languages.
  • Using noncompliant software on Unicode documents may cause incorrect display of non-Latin characters and sometimes data-file corruption.

That last point is crucial.  If, for example, Japanese characters cause the review tool to omit documents from search results or improperly display them during a privilege review, an inadvertent disclosure could result.  Further, a noncompliant tool can mangle Unicode characters when exporting documents for production by substituting symbols for unrecognized characters because the software didn’t know how to handle them.

So it is prudent to inquire about the specific Unicode capabilities of your software vendors and demand similar due diligence from service providers.

Is Unicode compliance the same thing as universal language support?  No.  Unicode compliance means only that the software has the ability to handle documents in languages that include characters beyond the A-Z scheme used in the Latin alphabet. The complexities of searching and reviewing a multi­lingual document collection are numerous and may require advanced functionality offered in very few of the available litigation tools.

Here are the other buzzwords/concepts to know:

Compounding: Some languages, including German, Dutch, Swedish and Finnish, use compound nouns that may complicate searching. For example, without the proper search syntax, a search based on the German word Kontaktlinse (contact lens) would miss a document that included the word Kontaktlinsenverträglich­keitstest (contact lens compatibility test).  Specialized tools exist to facilitate searching individual components of compound nouns, but few litigation support tools have incorporated such technology.

Tokenization: To facilitate rapid searching on large document collections, search tools use a tokenization process to identify discrete words and add them to a searchable index. For most Asian lan­guages – which use very little punctuation, don’t insert spaces between all words, and can have the meaning of characters change based on context -the process for breaking down documents into individual words can be very complex and require language-specific dictionaries. Again, few litigation tools are sophisticated enough to accommodate the idiosyncrasies of some languages.

Canonicalization: In most languages, there are multiple ways to express a single concept. Most search engines are good at handling the most common form of this in English, the synonym. Other languages, however, have more complex systems for representing concepts in multiple ways. For example, the meaning behind a Japanese ideogram can also be “spelled out” in one of several different kana character sets or transliterated phonetically into the Latin alphabet using the romaji system.  Problems also arise from languages where nouns can take on prefixes or suffixes based on the context in which they are used. For example, in Arabic the word for “my apple” and “your apple” are represented by distinctly different canonical forms with the same fundamental meaning.

And a big issue: what role does automated document translation play in discovery?  It is not an “either/or” situation.  You need a blended approach.   Catalyst recommends using search experts fluent in the lan­guages (such as Lilliam Clementi’s firm Lingua Legal) present in the document collection.  It is reasonable for certain phases of the process such as creating search-term lists for culling, reviewing documents, and final quality control. But having a translator “shadowing” everyone on the litigation team to translate every search isn’t always practical, especially if more than one foreign language is involved.

Machine translation can help. Although notoriously inaccurate compared to a manual process, less-expensive machine translation still can assist litigators in situations where it is impractical to have a human translator standing at the ready. Although it is not advisable to conclude definitively that there are no relevant documents based on only a search of machine-translated versions of documents, it is quite reasonable to use automated translations to make first-pass culling decisions.

Overall, in dealing with foreign languages in discovery, it is important to take a proactive stance, be knowledgeable about the complexities, and ask the right questions of vendors early in the process to avoid costly mistakes in the discovery management process.