How machine translation sometimes struggles with non-English languages
And why it matters for global development
Machine translation is an important piece of infrastructure for the modern world because it helps people communicate within and across cultures. Unfortunately, machine translation models often make mistakes when translating between languages besides English, especially when they contain linguistic concepts that are unknown to English, because they are developed in an English-centric way.
Example 1: Second-person pronouns
For example, English has basically one second-person pronoun: you. In modern English, we use you in formal and informal contexts; and officially, it can be a plural pronoun as well as a singular one, although various dialects have workarounds for the lack of a standard plural equivalent. By contrast, Spanish, French, German, and other European languages have singular and plural second-person pronouns as well as formal and informal ones.
For instance, most dialects of Spanish have the following second-person pronouns:1
tú: singular, informal
usted: singular, formal
ustedes: plural
In addition, many Latin American dialects have the singular pronoun vos, while the standard dialect in Spain has the plural pronoun vosotros.
Similarly, the most commonly used second-person pronouns in Mandarin Chinese are:2
你 nǐ: singular, informal
您 nín: singular, formal
你們 nǐmen: plural
So roughly speaking, tú and vos correspond to 你 nǐ in Mandarin, usted corresponds to 您 nín, and vosotros and ustedes correspond to 你們 nǐmen. But Google Translate doesn’t seem to understand these distinctions. For example, at the time of writing, Google translates tú as 您的 nín de (“yours”), which is a possessive phrase, and it translates usted as the informal pronoun 你 nǐ instead of its formal counterpart 您 nín. Going from Chinese to Spanish, Google conflates the Chinese informal pronoun 你 nǐ, the formal 您 nín, and the plural 你們 nǐmen with usted.
Example 2: Japanese demonstratives
Another linguistic concept that Google Translate struggles with is demonstratives. English, for example, has two main categories of demonstratives that represent objects that are near or far from the speaker: this, that, here, there, and so on. However, both Spanish and Japanese have demonstratives for objects that are near the speaker, near the listener, and away from the speaker and listener.
For example, the word for here in Spanish is aquí or acá, and it refers to a place near the speaker. To refer to a place near the listener, we say ahí, and to refer to a place away from the speaker and the listener, we say allí or allá. In Japanese, we use ここ koko to refer to a place near the speaker, そこ soko for a place near the listener, and あそこ asoko for a place away from the speaker and the listener.3
To test Google Translate’s fluency with these linguistic concepts, I wrote some example sentences in Spanish and had Google translate them into Japanese:
Spanish: El banco está aquí. (The bank is here.)
Japanese: 銀行はここにあります。(Ginkō wa koko ni arimasu.)
Spanish: El banco está ahí. (The bank is there [near the listener].)
Japanese: 銀行はそこにあります。(Ginkō wa soko ni arimasu.)
Spanish: El banco está allí. (The bank is over there [away from the listener].)
Japanese: 銀行はあそこにあります。(Ginkō wa asoko ni arimasu.)
When I put the first two Spanish sentences (with aquí and ahí) into Google Translate, it translated them just fine. But when I put in the third sentence (with allí), it translated allí as そこ soko, conflating the concepts of “there” (near the listener) and “over there” (away from the listener).
Explanation
According to this research blog post from Meta (formerly Facebook), most multilingual machine translation models rely on English-language training data, even when translating between two languages other than English. Machine translation algorithms like Google’s are trained on datasets containing millions of pairs of equivalent sentences in different languages. If there are many training examples for a given pair of languages, then the algorithm can learn to translate directly between them. But if there aren’t enough, then the algorithm can bridge the gap by going through a third language.
For example, if there are many English–Spanish and English–Chinese training examples, but too few Spanish–Chinese ones, then the algorithm won’t have enough knowledge to translate directly from Spanish to Chinese or vice versa, and will have to translate Spanish or Chinese text indirectly through English. And when going through English, the algorithm loses much of the meaning in the original text, such as whether the text used a singular pronoun like vos or a plural one like ustedes, since both get flattened into the generic pronoun you.
Logically, I would expect the number of training examples available for translation tasks to be related to the amount of written material translated between any two languages, since it’s far easier to scrape existing material for training examples than to write new examples from scratch. For example, English is now a global lingua franca, so a lot of documents are translated between English and every other language in the world. Similarly, I would expect to find many documents translated between the languages of Europe, especially English, French, and German, due to their high level of economic, political, and cultural integration. However, I would expect to find fewer existing translations between Spanish and Japanese, let alone good translations.
Why it matters
Machine translation eases communication and cooperation between people of different cultural backgrounds, so in our diverse and increasingly connected world, it can help bring all of us closer together. This makes high-quality machine translation important, especially between languages other than English.
For example, about 900,000 Chinese, 432,000 Japanese, and 230,000 Koreans live in Mexico.4 High-quality Spanish–Chinese, Spanish–Japanese, and Spanish–Korean machine translation models would be of immediate benefit to these populations. And as the developing regions of Asia, Latin America, and Africa grow their economies in the near future, high-quality machine translation between their respective languages would facilitate trade, migration, and collaboration between people in these regions. Improved machine translation software can also improve cooperation and civic engagement in multilingual societies, where it’s important for citizens to understand media produced in multiple languages.
Fortunately, some machine learning scientists at Meta have developed an improved multilingual machine translation model to better serve the more than 3 billion users across their platforms, including Facebook and WhatsApp. Rather than relying on English as the sole bridge language, as many existing translation models do, the Meta team grouped languages into 14 clusters based on linguistic and cultural similarities. They generated training data for all possible pairs of languages within each group, and selected a handful of bridge languages to enable translation between groups. This made much more training data available for translation than with English alone.
The result? The average translation quality for over 2500 non-English-language translation directions improved by 5.5 BLEU over a translation strategy that involved using an English-centric model to translate the source text first to English, then to the target language. Performance especially improved for pairs of similar languages, such as Spanish and Portuguese, and for pairs of languages that are spoken in the same country, such as Russian and Kazakh in Kazakhstan, and Afrikaans and Xhosa in South Africa.
Hopefully, with improvements like this, machine translation software will become better able to help people around the world communicate and cooperate.
Spanish personal pronouns – Wikipedia
Chinese pronouns – Wikipedia
Japanese grammar § Demonstratives – Wikipedia
Asian Latin Americans § Composition – Wikipedia