Why are Neural Networks so Effective for Translation ?

Three years ago, neural network-based translation appeared to give better results than anything researchers developed within the last twenty years. The sentences sound a lot more natural compared with previous translation methods.

First Try: Rule-Based Translation

Translations can be done by adding grammatical rules to a basic bilingual dictionary, but such translations take up a huge amount of time to program the rules, and seldom work in the real world. Languages are full of exceptions, variations, and local idioms, making these translations difficult to understand.

A Significant Improvement with Statistical Translation

This is where statistical translation appeared. This method uses parallel bilingual corpora, a type of machine learning technology which associates words or phrases in one language with the corresponding ones in the other language. The algorithm breaks the original sentence into smaller pieces, and finds every possible translation for each bit. It then generates all the possible sentences in the target language from the translated bits, and finds the most likely one by comparing it to all of the sentences available in its database.

Thus, the machine can estimate which word has the greatest probability of fitting in with the surrounding ones. For example, when translating from English to French, the phrase “the cat” will be translated with the correct gender because the machine mostly sees the word “chat” with the word “le” placed before it.

Limitations of Statistical Machine Translation

However, this method can only translate words in their near context. The task becomes more complicated when the sentence is quite long, especially as regards word order. The meaning of this sentence can easily be lost, and every new pair of languages you want to translate means that experts will need to step in and tune the new translation algorithm. This is why neural networks were introduced as part of the process.

Introduction to Neural Networks

Neural networks are machine learning algorithms which are used to predict results (say, the weather) based upon information fed into it (atmospheric pressure, temperature, etc… from past years associated with the resulting weather). This last step is called “training”. With this model, humans need to add data by themselves.

However, recurrent neural networks (RNNs) have a slight tweak, in that they include the result of previous calculations as a new input. Such a model can then learn by itself, as it identifies patterns in data sequences from previous calculations. For example, an RNN built to identify word patterns could predict the next most likely word in a given sentence. At the same time, the machine learns new phrases and sentences as the user types and confirm what he wants to write.

The Power of Encoding

The only problem with this is that computers aren’t very good at processing information in words and sentences. They’d much rather have a series of numbers. We need a second method : encoding. Encoding means transforming information into code. Here the information is the sentence. The code is a series of unique numbers associated with this sentence. No other sentence in that language will have the exact same numbers associated with it. We do not know exactly which numbers distinguishes one sentence from another, but it’s not such a matter, so long as it can give us back the original sentence with this numerical data.

Now, if you’re using an RNN to encode data from English into code and then back into English, it’s kind of a waste of time. However, if you’re using one RNN to encode English sentences into code, and add another RNN that converts the coded information into German via the statistical method, you can get a natural-sounding German sentence in a short amount of time, with a database that keeps on growing with each new translation.

Why Data is so Precious

Besides a huge processing power requirement, the limit to this approach is the amount of data. The RNN learns by itself from existing sentences. That represents terabytes of information which leads us to the idea of Big Data. Only the giants (Facebook, Microsoft, Google) have developed such a translation method – because they are the only ones that can afford it!

Read our interview with Gauthier Vasseur for more details about Big Data and the translation industry.

Sources: