Three years ago, neural network-based translation appeared to give better results than anything researchers developed within the last twenty years. The sentences sound a lot more natural compared with previous translation methods.

First Try: Rule-Based Translation

Translations can be done by adding grammatical rules to a basic bilingual dictionary, but such translations take up a huge amount of time to program the rules, and seldom work in the real world. Languages are full of exceptions, variations, and local idioms, making these translations difficult to understand.

A Significant Improvement with Statistical Translation

This is where statistical translation appeared. This method uses parallel bilingual corpora, a type of machine learning technology which associates words or phrases in one language with the corresponding ones in the other language. The algorithm breaks the original sentence into smaller pieces, and finds every possible translation for each bit. It then generates all the possible sentences in the target language from the translated bits, and finds the most likely one by comparing it to all of the sentences available in its database.

Thus, the machine can estimate which word has the greatest probability of fitting in with the surrounding ones. For example, when translating from English to French, the phrase “the cat” will be translated with the correct gender because the machine mostly sees the word “chat” with the word “le” placed before it.

Limitations of Statistical Machine Translation

However, this method can only translate words in their near context. The task becomes more complicated when the sentence is quite long, especially as regards word order. The meaning of this sentence can easily be lost, and every new pair of languages you want to translate means that experts will need to step in and tune the new translation algorithm. This is why neural networks were introduced as part of the process.

Introduction to Neural Networks

Neural networks are machine learning algorithms which are used to predict results (say, the weather) based upon information fed into it (atmospheric pressure, temperature, etc… from past years associated with the resulting weather). This last step is called “training”. With this model, humans need to add data by themselves.

However, recurrent neural networks (RNNs) have a slight tweak, in that they include the result of previous calculations as a new input. Such a model can then learn by itself, as it identifies patterns in data sequences from previous calculations. For example, an RNN built to identify word patterns could predict the next most likely word in a given sentence. At the same time, the machine learns new phrases and sentences as the user types and confirm what he wants to write.

The Power of Encoding

The only problem with this is that computers aren’t very good at processing information in words and sentences. They’d much rather have a series of numbers. We need a second method : encoding. Encoding means transforming information into code. Here the information is the sentence. The code is a series of unique numbers associated with this sentence. No other sentence in that language will have the exact same numbers associated with it. We do not know exactly which numbers distinguishes one sentence from another, but it’s not such a matter, so long as it can give us back the original sentence with this numerical data.

Now, if you’re using an RNN to encode data from English into code and then back into English, it’s kind of a waste of time. However, if you’re using one RNN to encode English sentences into code, and add another RNN that converts the coded information into German via the statistical method, you can get a natural-sounding German sentence in a short amount of time, with a database that keeps on growing with each new translation.

Why Data is so Precious

Besides a huge processing power requirement, the limit to this approach is the amount of data. The RNN learns by itself from existing sentences. That represents terabytes of information which leads us to the idea of Big Data. Only the giants (Facebook, Microsoft, Google) have developed such a translation method – because they are the only ones that can afford it!

Read our interview with Gauthier Vasseur for more details about Big Data and the translation industry.

Sources:

XML Localization Interchange File Format (XLIFF), Translation Memory eXchange (TMX), and Term Base eXchange (TBX). Do any of these sound familiar to you? These are file formats widely used in the translation and localization industry.

The types of files that commonly use these formats are:

  • Files prepared for translation/translated files
  • Translation memories
  • Terminology databases

Written with  XML technology, these formats can be used across multiple translation software programs and tools without any file conversion. Let’s say you have a Trados translation memory (TM) that you need to use on Catalyst. You can export your TM as an XML file with Trados and then import that same XML file into Catalyst. Both software programs will be able to read, write, and use the information in the exported file.

How Does XML Work in Translation and Localization?

Software programs read XML files in the same way one would methodically look for information in folders: open the first folder, read what’s inside, close the folder and open the next folder.

We could compare the tags (in red below) to folder names. When a software program meets a tag such as <xliff …>, it opens the “xliff” folder. It closes the folder when it meets the </xliff> closing tag. Meanwhile, it has read all that’s between these two tags.

An XLIFF file contains many translation units that associate a source term with its target equivalent. Extra information is added through the attributes (in green). That’s how you know the name, the approved status of the translation unit, or the language of each element.

More about XLIFF, TMX and TBX

These three file formats are fairly standard. There are specific tags for each of them.
Here are some information specific to each:

  • XLIFF (translation file): matching translation memory segments, quality match, translator contact details…
  • TMX (translation memory): information about the usage of the term, about how to translate a specific proper name, formatting information…
  • TBX (terminology base): date and author of the term translation entry, everything about the term (entry, definition, field, context…)

You now know a little bit more about these XML formats and the purpose of file conversion in translation and localization processes. You are able to better understand the structure of their files, and how their versatile nature is key to enabling rapid and efficient localization.