> we group languages into 14 language groups based on linguistic classification, geography, and cultural similarities. We did this because people living in countries with languages of the same family tend to communicate more often and would benefit from high-quality translations. For instance, one group would include languages spoken in India, like Bengali, Hindi, Marathi, Nepali, Tamil, and Urdu. We systematically mined all possible language pairs within each group.
> To connect the languages of different groups, we identified a small number of bridge languages, which are usually one to three major languages of each group. In the example above, Hindi, Bengali, and Tamil would be bridge languages for Indo-Aryan languages. We then mined parallel training data for all possible combinations of these bridge languages.
A big problem with this kind of massively multilingual machine learning research is that the researchers in question know almost nothing about most of the languages they're dealing with. They also grouped Malayalam with Malay. (Though they also say that they focused on languages that get the most translation requests, so maybe this is down to users getting confused about which language they want.)
Their parallel sentence mining project LASER also has problems that are obvious when you know the languages involved. Some time ago I looked at their most confident matches for English-Chinese and briefly thought I was looking at the least confident ones, because Bible quotes were paired with random snippets in Classical Chinese. I think their embedding model was confused by the archaic language.
So I'm glad they also used human evaluators and not just BLEU scores, but I'd've really liked to see a human evaluation of their training data. I think it's possible that the model can average out noise to produce better garbage when you put garbage in, but it might also get completely confused and produce worse garbage. With their testing setup, it's impossible to tell whether more data or better data is needed to improve the performance of this model.
Some of the assumptions about language in this paper are just total junk lol... this one is particularly good "...and for the
rest, overlapping vocabulary is a good proxy for similar languages" - this is so wrong I don't even know where to start. The grouping of language by family is also bizarre, the genetic groupings they give for each language are at all sorts of different levels. They say that cultural and geographic proximity was also a factor in grouping, but e.g. the Mongolic and Kra-Dai families have essentially nothing in common apart from the fact the people who speak them look sort of similar to a European. Grouping the Afroasiatic languages Somali and Amharic with the Niger-Congo set also seems like the only criterion was the physical appearance of the speakers...
There is also no way for a reader of the paper to judge the effectiveness of the algorithm. They cite this evaluation of "semantic accuracy", but nothing about the design of the task, participant selection, example data.
This paper is pretty much junk science. Even the reference section is amateurishly formatted
I don't know anything about the situation there, but it might still make sense to group it if it's in the same "linguistic area" (see https://en.wikipedia.org/wiki/Sprachbund ). E.g. the Apertium translator from Northern Saami to Norwegian is very useful since both languages – though from very different families – are spoken in the same country and speakers have had millennia of contact, so there's more translated text available than you'd otherwise expect from such different languages and there's need for more translations.
TIL that the German "Morgenstund hat Gold im Mund", and I guess the Norwegian version, too (?), is a translation from Latin "Aurora habet aurum in ore". Apparently, this was, at least partially, meant literally (Aurora was said to have worn gold in her hair and mouth).
I find it really interesting how idioms survive, even if the language around them changes. In Dutch it's "Morgenstond heeft goud in de mond" which is in incredibly close to German and Norwegian, and also the only place you would realistically find the word "Morgenstond" as it has fallen completely out of use.
> "morgenstund har gull i munn" was translated to "early bird has gold in its mouth" which is just weird
I beg your pardon, but - how would you translate it instead? Edit: sorry, is it because of 'morgenstund' → 'early bird'? It should be illegitimate to go in that direction of metaphor - in fact, the engine probably thought that "morning time" is the metaphor and "early bird" the more direct reference -, nonetheless there is surely a definite link.
In almost all contexts I'd translate it as "early bird catches the worm". Maybe there is another common English idiom that matches, but the semantics are very similar. "You snooze you lose" is another option I found. Both don't carry the lyrical aspect. Translating idioms is hard.
Reminds me of mixed idioms like "it´s as easy as falling of a piece of care" or "does the pope shit in the woods?" or "He’s burning the midnight oil from both ends".
It tries to be smart and just like a Google search for something unusual during the ten last years it sometimes works but most of the time fails in spectacular ways.
Before an AI confidently proposes interpretations of metaphors, it must really provide credentials of intelligence. Otherwise, you risk going in the direction of that fault which is a major mark of the "especially unrefined", of taking one's conclusions for stable and truthy instead of tentative.
In the specific case, some will interpret "gold in mouth" as "the best time to be active", which is very different from "delightful". So, it is best to remain literal, in order to avoid the pollution of injecting prejudicial interpretations.
I think "gjøre stas på" must be considered an idiom, and automatic translation services will always have problems with that. I'm fluent in both Norwegian and English, and I wouldn't know how to translate that expression.
> As for how to translate it, I'd say almost anything would better.
More specifically, now that I've left the office (and cannot work on the train anyways because the Windows activated "Boil my laptop" once too many on my way out of the office it seems) I'd suggest that something like:
My point is that it's hard to identify - and translate - idioms like these.
One of my favorites is the English idiom "make one's hair stand on end", which can be translated (directly) as "fikk ens hår til å stå opp", but we Norwegians prefers "gåsehud" (literally "goose skin") or "fikk hårene til å reise seg" (literally "made the hairs rise") instead.
Add to the problem if the scared or excited entity in question is actually called "One."
Facebook seems to be releasing a lot of news about their projects, lately. It all stinks of desperation. I'm thinking Facebook is going the way of MySpace, but it's going to be drawn out and ugly because Instagram and other properties will keep it on life support. If they're broken up by antitrust actions, Facebook is toast.
> Facebook seems to be releasing a lot of news about their projects, lately.
A cursory check at Facebook Research Blog will tell you that there is no connection between the Research Blog and Facebook business news [0]. If you are looking for correlations, you can check the time for Conference paper acceptance / occurrence and you will typically see a better correlation. The paper linked in the OP will be published at EMNLP 2021 and guess when the conference happens? Surprise : November 10-11, 2021 [1].
This is super true, alot of the time Facebook's translations are garbage. TikTok's translations on the other hand seem super good to me. I wonder who they are using....
> To connect the languages of different groups, we identified a small number of bridge languages, which are usually one to three major languages of each group. In the example above, Hindi, Bengali, and Tamil would be bridge languages for Indo-Aryan languages. We then mined parallel training data for all possible combinations of these bridge languages.
Pretty cool.
https://scontent-arn2-2.xx.fbcdn.net/v/t39.8562-6/122141102_... seems like they actually do improve on a quite a lot of single-pair WMT scores.