8 July 2017 – Last week we attended a Google Research workshop in Zurich, Switzerland on Google’s quest to “end the language barrier” and a signifiant part of the program was devoted to non-English e-discovery document reviews. Two of the “Googlers” who work at Google’s e-discovery unit at Mountain View, CA were there.
As we have noted in past posts, we have come a long way toward making automated language translation easier, faster and more reliable, but a world of seamless and immediate translation is still out of our grasp.
But … it is getting better and better with Google, Facebook, IBM and Microsoft all devoting gobs of money to perfect instant, seamless translation, with Google, IBM and Microsoft creating special legal translation units. In the 1960s the first machine translation architectures were developed by IBM based on mathematical models.
The Google engineers had a simple premise: to translate one language into another, by finding the linear transformation that maps one to the other:
“We have developed a technique that automatically generates dictionaries and phrase tables that convert one language into another. The new technique does not rely on versions of the same document in different languages. Instead, it uses data mining techniques to model the structure of a single language and then compares this to the structure of another language. This method makes little assumption about the languages, so it can be used to extend and reﬁne dictionaries and translation tables for any language pairs”
Ah, the sheer beauty of algorithms.
As one of the presenters noted, Arab newspapers have a reputation, partly deserved, for tamely taking the official line. On any given day, for example, you might read that “a source close to the Iranian Foreign Ministry told Al-Hayat that ‘Tehran will continue to abide by the terms of the nuclear agreement as long as the other side does the same.'”
But the exceptional thing about this unexceptional story is that, thanks to Google, English-speaking readers can now read this in the Arab papers themselves. In the past year free online translators have suddenly got much better. This may come as a surprise to those who have tried to make use of them in the past. But last November Google unveiled a new version of Translate. The old version, called “phrase-based” machine translation, worked on hunks of a sentence separately, with an output that was usually choppy and often inaccurate.
The new system still makes mistakes, but these are now relatively rare, where once they were ubiquitous. It uses an artificial neural network, linking digital “neurons” in several layers, each one feeding its output to the next layer, in an approach that is loosely modelled on the human brain. Neural-translation systems, like the phrase-based systems before them, are first “trained” by huge volumes of text translated by humans.
But the neural version takes each word, and uses the surrounding context to turn it into a kind of abstract digital representation. It then tries to find the closest matching representation in the target language, based on what it has learned before. Neural translation handles long sentences much better than previous versions did.
The new Google Translate began by translating eight languages to and from English, most of them European. It is much easier for machines (and humans) to translate between closely related languages. But Google has also extended its neural engine to languages like Chinese (included in the first batch) and, more recently, to Arabic, Hebrew, Russian and Vietnamese, an exciting leap forward for these languages that are both important and difficult. This past April Google extended neural translation to nine Indian languages. Microsoft also has a neural system for several hard languages.
Ok, Google Translate does still occasionally garble sentences. The introduction to a Haaretz story in Hebrew had text that Google translated as:
“According to the results of the truth in the first round of the presidential elections, Macaron and Le Pen went to the second round on May 7. In third place are Francois Peyon of the Right and Jean-Luc of Lanschon on the far left.”
If you don’t know what this is about, it is nigh on useless. But if you know that it is about the French election, you can see that the engine has badly translated “samples of the official results” as “results of the truth”. It has also given odd transliterations for (Emmanuel) Macron and (François) Fillon (P and F can be the same letter in Hebrew). And it has done something particularly funny with Jean-Luc Mélenchon’s surname. “Me-” can mean “of” in Hebrew. The system is “dumb”, having no way of knowing that Mr Mélenchon is a French politician. It has merely been trained on lots of text previously translated from Hebrew to English.
Such fairly predictable errors should gradually be winnowed out as the programmers improve the system. But some “mistakes” from neural-translation systems can seem mysterious. Users have found that typing in random characters in languages such as Thai, for example, results in Google producing oddly surreal “translations” like: “There are six sparks in the sky, each with six spheres. The sphere of the sphere is the sphere of the sphere.”
Although this might put a few postmodern poets out of work, neural-translation systems aren’t ready to replace humans any time soon. Literature requires far too supple an understanding of the author’s intentions and culture for machines to do the job.
And for critical work – technical, financial or legal, say – small mistakes (of which even the best systems still produce plenty) are unacceptable; a human will at the very least have to be at the wheel to vet and edit the output of automatic systems.
But online translating is of great benefit to the globally curious. Many people long to see what other cultures are reading and talking about, but have no time to learn the languages. Though still finding its feet, the new generation of translation software dangles the promise of being able to do just that.