AI model that translates 100 languages without relying on English data (2020)

_pvxk · on Nov 11, 2021

> we group languages into 14 language groups based on linguistic classification, geography, and cultural similarities. We did this because people living in countries with languages of the same family tend to communicate more often and would benefit from high-quality translations. For instance, one group would include languages spoken in India, like Bengali, Hindi, Marathi, Nepali, Tamil, and Urdu. We systematically mined all possible language pairs within each group.

> To connect the languages of different groups, we identified a small number of bridge languages, which are usually one to three major languages of each group. In the example above, Hindi, Bengali, and Tamil would be bridge languages for Indo-Aryan languages. We then mined parallel training data for all possible combinations of these bridge languages.

Pretty cool.

https://scontent-arn2-2.xx.fbcdn.net/v/t39.8562-6/122141102_... seems like they actually do improve on a quite a lot of single-pair WMT scores.

resonance1994 · on Nov 11, 2021

Tamil is not an Indo-Aryan language though? It's a Dravidian language

yorwba · on Nov 11, 2021

A big problem with this kind of massively multilingual machine learning research is that the researchers in question know almost nothing about most of the languages they're dealing with. They also grouped Malayalam with Malay. (Though they also say that they focused on languages that get the most translation requests, so maybe this is down to users getting confused about which language they want.)

Their parallel sentence mining project LASER also has problems that are obvious when you know the languages involved. Some time ago I looked at their most confident matches for English-Chinese and briefly thought I was looking at the least confident ones, because Bible quotes were paired with random snippets in Classical Chinese. I think their embedding model was confused by the archaic language.

So I'm glad they also used human evaluators and not just BLEU scores, but I'd've really liked to see a human evaluation of their training data. I think it's possible that the model can average out noise to produce better garbage when you put garbage in, but it might also get completely confused and produce worse garbage. With their testing setup, it's impossible to tell whether more data or better data is needed to improve the performance of this model.

mrbukkake · on Nov 11, 2021

Some of the assumptions about language in this paper are just total junk lol... this one is particularly good "...and for the rest, overlapping vocabulary is a good proxy for similar languages" - this is so wrong I don't even know where to start. The grouping of language by family is also bizarre, the genetic groupings they give for each language are at all sorts of different levels. They say that cultural and geographic proximity was also a factor in grouping, but e.g. the Mongolic and Kra-Dai families have essentially nothing in common apart from the fact the people who speak them look sort of similar to a European. Grouping the Afroasiatic languages Somali and Amharic with the Niger-Congo set also seems like the only criterion was the physical appearance of the speakers...

There is also no way for a reader of the paper to judge the effectiveness of the algorithm. They cite this evaluation of "semantic accuracy", but nothing about the design of the task, participant selection, example data.

This paper is pretty much junk science. Even the reference section is amateurishly formatted

_pvxk · on Nov 11, 2021

Haha, nice catch :)

I don't know anything about the situation there, but it might still make sense to group it if it's in the same "linguistic area" (see https://en.wikipedia.org/wiki/Sprachbund ). E.g. the Apertium translator from Northern Saami to Norwegian is very useful since both languages – though from very different families – are spoken in the same country and speakers have had millennia of contact, so there's more translated text available than you'd otherwise expect from such different languages and there's need for more translations.

eitland · on Nov 11, 2021

If this is what powers Facebook workplace (or whatever it is called), don't get too enthusiastic:

From my minimal use of it I've seen one case of praise being translated to an insult[1] and one which was just hilariously funny.

[1]: For those who understand Norwegian: "gjere stas på en kunder" became "make fun of customers", i.e. the almost exact opposite.

"morgenstund har gull i munn" was translated to "early bird has gold in its mouth" which is just weird.

tralarpa · on Nov 11, 2021

TIL that the German "Morgenstund hat Gold im Mund", and I guess the Norwegian version, too (?), is a translation from Latin "Aurora habet aurum in ore". Apparently, this was, at least partially, meant literally (Aurora was said to have worn gold in her hair and mouth).

thatwasunusual · on Nov 11, 2021

Really?

I was taught that "Mund" actually means "hand", meaning that you have better chance of success if you get up early in the morning.

The English version is "the early bird gets the worm."

Boltgolt · on Nov 11, 2021

I find it really interesting how idioms survive, even if the language around them changes. In Dutch it's "Morgenstond heeft goud in de mond" which is in incredibly close to German and Norwegian, and also the only place you would realistically find the word "Morgenstond" as it has fallen completely out of use.

mdp2021 · on Nov 11, 2021

> "morgenstund har gull i munn" was translated to "early bird has gold in its mouth" which is just weird

I beg your pardon, but - how would you translate it instead? Edit: sorry, is it because of 'morgenstund' → 'early bird'? It should be illegitimate to go in that direction of metaphor - in fact, the engine probably thought that "morning time" is the metaphor and "early bird" the more direct reference -, nonetheless there is surely a definite link.

morsch · on Nov 11, 2021

In almost all contexts I'd translate it as "early bird catches the worm". Maybe there is another common English idiom that matches, but the semantics are very similar. "You snooze you lose" is another option I found. Both don't carry the lyrical aspect. Translating idioms is hard.

ant6n · on Nov 11, 2021

That translation is definitely half-way there.

Reminds me of mixed idioms like "it´s as easy as falling of a piece of care" or "does the pope shit in the woods?" or "He’s burning the midnight oil from both ends".

eitland · on Nov 11, 2021

Exactly.

It tries to be smart and just like a Google search for something unusual during the ten last years it sometimes works but most of the time fails in spectacular ways.

moss2 · on Nov 11, 2021

early mornings are delightful?

mdp2021 · on Nov 11, 2021

Before an AI confidently proposes interpretations of metaphors, it must really provide credentials of intelligence. Otherwise, you risk going in the direction of that fault which is a major mark of the "especially unrefined", of taking one's conclusions for stable and truthy instead of tentative.

In the specific case, some will interpret "gold in mouth" as "the best time to be active", which is very different from "delightful". So, it is best to remain literal, in order to avoid the pollution of injecting prejudicial interpretations.

thatwasunusual · on Nov 11, 2021

> "gjere stas på en kunder"

That's not very good Norwegian, is it? :)

I think "gjøre stas på" must be considered an idiom, and automatic translation services will always have problems with that. I'm fluent in both Norwegian and English, and I wouldn't know how to translate that expression.

eitland · on Nov 11, 2021

gjere/gjøre are equal but I edited the sentence to plural after checking a screenshot and forgot to remove "en".

As for how to translate it, I'd say almost anything would better.

eitland · on Nov 11, 2021

> As for how to translate it, I'd say almost anything would better.

More specifically, now that I've left the office (and cannot work on the train anyways because the Windows activated "Boil my laptop" once too many on my way out of the office it seems) I'd suggest that something like:

- celebrating the achievements of some customers

- throwing a party for some customers

or something to that effect

thatwasunusual · on Nov 11, 2021

My point is that it's hard to identify - and translate - idioms like these.

One of my favorites is the English idiom "make one's hair stand on end", which can be translated (directly) as "fikk ens hår til å stå opp", but we Norwegians prefers "gåsehud" (literally "goose skin") or "fikk hårene til å reise seg" (literally "made the hairs rise") instead.

Add to the problem if the scared or excited entity in question is actually called "One."

I rest my case.

robbedpeter · on Nov 11, 2021

Facebook seems to be releasing a lot of news about their projects, lately. It all stinks of desperation. I'm thinking Facebook is going the way of MySpace, but it's going to be drawn out and ugly because Instagram and other properties will keep it on life support. If they're broken up by antitrust actions, Facebook is toast.

tchalla · on Nov 11, 2021

> Facebook seems to be releasing a lot of news about their projects, lately.

A cursory check at Facebook Research Blog will tell you that there is no connection between the Research Blog and Facebook business news [0]. If you are looking for correlations, you can check the time for Conference paper acceptance / occurrence and you will typically see a better correlation. The paper linked in the OP will be published at EMNLP 2021 and guess when the conference happens? Surprise : November 10-11, 2021 [1].

[0] https://ai.facebook.com/blog/

[1] http://www.statmt.org/wmt21/

dvh · on Nov 11, 2021

One day facebook.com will be parked domain :)

shalmanese · on Nov 11, 2021

This blog post is from October of 2020.

hulitu · on Nov 11, 2021

That's how propaganda works: drown the user in noise.

supermatt · on Nov 11, 2021

If this is what they use on the website... it needs a LOT more training.

hungryforcodes · on Nov 11, 2021

This is super true, alot of the time Facebook's translations are garbage. TikTok's translations on the other hand seem super good to me. I wonder who they are using....

maxehmookau · on Nov 11, 2021

Which will help them remove the cesspit of misinformation that its platform has become regardless of the language?