To transcribe spoken language to written medium, most alphabets enable an unambiguous sound-to-letter rule. However, some writing systems have distanced themselves from this simple concept and little work exists in Natural Language Processing (NLP) on measuring such distance [orthographic depth].
In this study, we use an Artificial Neural Network (ANN) model to evaluate the transparency between written words and their pronunciation, hence its name Orthographic Transparency Estimation with an ANN (OTEANN). Based on datasets derived from Wikimedia dictionaries, we trained and tested this model to score the percentage of correct predictions in phoneme-to-grapheme and grapheme-to-phoneme translation tasks.
The scores obtained on 17 orthographies were in line with the estimations of other studies. Interestingly, the model also provided insight into typical mistakes made by learners who only consider the phonemic rule in reading and writing.
[”Among the tested orthographies, Chinese and French orthographies, followed by English and Russian, are the most opaque regarding writing (ie. phonemes to graphemes direction) and English, followed by Dutch, is the most opaque regarding reading (ie. graphemes to phonemes direction); Esperanto, Arabic, Finnish, Korean, Serbo-Croatian and Turkish are very shallow both to read and to write; Italian is shallow to read and very shallow to write, Breton, German, Portuguese and Spanish are shallow to read and to write.“]
…Our study first confirms that orthographies like Arabic, Finnish, Korean, Serbo-Croatian and Turkish are highly transparent whereas other ones like Chinese, French and English are highly opaque. For example, when solely based on a phoneme-grapheme correspondence, we estimated the chances of correctly writing a French word at 28%; similarly, when solely based on a grapheme-phoneme correspondence, we estimated the chances of correctly pronouncing an English word at 31%. For Dutch, English and French reading tasks, our obtained ranking is in line with the one of van den Bosch et al 199430ya.
Table 3: Phonemic transparency scores. (OTEANN trained with 10,000 samples.)
Figure 3: Scatterplot of the mean scores. (OTEANN trained with 10,000 samples.)
…Surprisingly, the model also predicted spellings that do not exist but who could have existed, in the same vein as ThisWordDoesNotExist.com. For instance, OTEANN predicted that the spelling of the French word /swaKe/ was “soirer”, which does not exist but looks like a French infinitive verb that would mean “to celebrate at a party”.
…As OTEANN also points out some possible grapheme or phoneme errors when writing or reading phonemically, it could also be used to detect possible errors in the dictionaries of transparent orthographies; it could also be used to evaluate proposals for improving opaque orthographies.
Finally, it would be beneficial to investigate if our ANN and its artificial neural units somehow imitate the way a beginner learns to write and read a language. If so, it might suggest that a transparent orthography would be easier and faster to learn than an opaque orthography.