A world of seamless and immediate language translation gets closer

Language machine translation

 

25 January 2014 – “Literature is the opposite of data” wrote novelist Stephen Marche in the Los Angeles Times Review of Books last year. He cited his favorite line from Shakespeare’s Macbeth:

“Light thickens, and the crows make wing to the rooky wood.”

Marche went on to ask, “What is the difference between a crow and a rook? Nothing. What does it mean that light thickens? Who knows?” Although the words work, they make no sense as pure data, according to Marche.

But there are many who would disagree with him. With the rise of digital technologies, the paramount role of human intuition and interpretation in humanistic knowledge is being challenged as never before, and the scientific method has tiptoed into the English department. There is a pitched debate over what it means for the profession, and whether the attempt to quantify something as elusive as human intuition is simply misguided.

It all started when some sophisticated technology was developed to execute “topic modeling” which  looks beyond the words to the context in which they are used. It can infer what topics are discussed in each book, revealing patterns in a body of literature that no human scholar could ever spot. Topic-modeling algorithms allow us to view literature as if through a telescope, scanning vast swaths of text and searching for constellations of meaning-“distant reading,” a term coined by Franco Moretti of Stanford University.

Topic modeling produces “bags” of words that belong together, such as “species global climate co2 water”. Similar to this is Google’s N-gram server which allows you to track the frequency of words or word combinations in the Google Books database over time and thereby provide word meaning and changes in word meaning.

But topic modeling overcomes a fundamental limitation of N-grams: you don’t know the context in which the word appeared. Which documents used “black” to mean a color, and which ones used it to refer to a race? N-grams cannot tell you. A topic-modeling algorithm infers, for each word in a document, what topic that word refers to. Automatically, without human intervention, it makes the call as to whether “black” refers to a race or a color. In theory, at least, it reaches beyond the word to capture the meaning.

In the legal industry … especially e-discovery/document review — quantitative analysis of this kind is becoming old hat. The advent of computer assisted review/ technology assisted review/predictive coding … let’s just call it TAR … to document review processes has shown that advanced computer analytics can produce more accurate results than reviews using only keyword search and human review. Oh, some debate still lingers on when-to-use-and-when-not-to-use TAR and a few quibbles on the numbers but it is being embraced, slowly, and being applied outside the straight document review application. And with the e-discovery industry pounding, pounding, pounding away at its value (just scan Rob Robinson’s weekend collection below, or any weekend, really) it will probably become a regular tool in the lawyer’s toolkit.

But the big money play is really to make automated language translation easier, faster and more reliable, a world of seamless and immediate translation. Google, Facebook, IBM and Microsoft are all devoting gobs of money to perfect instant, seamless translation, with Google and IBM creating special legal translation units.

For the vendors and the multinational companies who need it, the business model is a no brainer. The value of an automated, instant, seamless translation platform to a corporation means Google and IBM could charge a substantial amount of money for such a tool.

The usual issues are well known: context, syntax, intonation and ambiguity. Because a computer system is not context aware, it could grab the wrong word. Additionally, it doesn’t understand the language at all. It just tries to decode words, instead of decoding the meaning. Many languages are not similar at all, and do not have corresponding common words and/or their usage is not the same at all.

But the technology is getting much, much better. Last week in Paris we attended a Microsoft Research event with its Natural Language Processing group, and this coming week we’ll be at a Google workshop about machine translation (MT) projects which are focused on creating MT systems and technologies that cater to the multitude of translation scenarios today, including legal. As we reported last year, Tomas Mikolov and his team at Google have developed a technique that automatically generates dictionaries and phrase tables that convert one language into another.

The Google advances are incredible. In short (to summarize the paper we just linked to under Tomas’ name above) what the Google team did was to translate missing word and phrase entries by learning language structures based on  large monolingual data and mapping between languages from small bilingual data. Their system uses distributed representation of words and “learns” a linear mapping between vector spaces of languages. Despite its simplicity, the method is surprisingly effective:  they achieved an almost 90% precision for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

But the “Big Plan”:  to represent an entire language using the relationship between its words. The set of all the relationships, the so-called “language space”, can be thought of as a set of vectors that each point from one word to another. And in recent years, linguists have discovered that it is possible to handle these vectors mathematically. For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that is similar to ‘queen’.

We are going to have much, much, much more on this in March. In late February the founder of The Posse List will be meeting again with IBM  Research’s Cognitive Systems team  and Watson team at the Mobile World Congress with language translation at the top of agenda, and then attending a “global imperialism of English” workshop in Paris. Such fun!

So for now, contract attorneys with non-English language skills will stay in demand. But just as predictive coding has rended the English language document review market (and is being used in more and more non-English language document reviews) those nasty algorithms are making their way across all languages. As Marc Andreessen said several years ago in his prescient essay Software is eating the world “all of the technology required to transform industries through software is finally working and can be widely delivered at global scale … don’t be on the wrong side of software-based disruption”.

 

Catarina Conti, CEO

Eric De Grasse, Chief Technology Officer