“How Machine Learning Can Help Unlock the World of Ancient Japan”, Alex Lamb2019-11-17 (, , , ; backlinks; similar)⁠:

Humanity’s rich history has left behind an enormous number of historical documents and artifacts. However, virtually none of these documents, containing stories and recorded experiences essential to our cultural heritage, can be understood by non-experts due to language and writing changes over time…This is a global problem, yet one of the most striking examples is the case of Japan. From 800 until 1900 AD, Japan used a writing system called Kuzushiji, which was removed from the curriculum in 1900 when the elementary school education was reformed. Currently, the overwhelming majority of Japanese speakers cannot read texts which are more than 150 years old. The volume of these texts—comprised of over three million books in storage but only readable by a handful of specially-trained scholars—is staggering. One library alone has digitized 20 million pages from such documents. The total number of documents—including, but not limited to, letters and personal diaries—is estimated to be over one billion. Given that very few people can understand these texts, mostly those with PhDs in classical Japanese literature and Japanese history, it would be very expensive and time-consuming to finance for scholars to convert these documents to modern Japanese. This has motivated the use of machine learning to automatically understand these texts.

…Given its importance to Japanese culture, the problem with using computers to help with Kuzushiji recognition has been explored extensively through the use of various methods in deep learning and computer vision. However, these models were unable to achieve strong performance on Kuzushiji recognition. This was due to inadequate understanding of Japanese historical literature in the optical character recognition (OCR) community and the lack of high quality standardized datasets. To address this, the National Institute of Japanese Literature (NIJL) created and released a Kuzushiji dataset, curated by the Center for Open Data in the Humanities (CODH). The dataset currently has over 4000 character classes and a million character images.

KuroNet: KuroNet is a Kuzushiji transcription model that I developed with my collaborators, Tarin Clanuwat and Asanobu Kitamoto from the ROIS-DS Center for Open Data in the Humanities at the National Institute of Informatics in Japan. The KuroNet method is motivated by the idea of processing an entire page of text together, with the goal of capturing both long-range and local dependencies. KuroNet passes images containing an entire page of text through a residual U-Net architecture (FusionNet) in order to obtain a feature representation…For more information about KuroNet, please checkout our paper “KuroNet: Pre-Modern Japanese Kuzushiji Character Recognition with Deep Learning”, which was accepted to the 2019 International Conference on Document Analysis and Recognition (ICDAR)

Kaggle Kuzushiji Recognition Competition: While KuroNet achieved state-of-the-art results at the time of its development and was published in the top tier conference on document analysis and recognition, we wanted to open this research up to the broader community. We did this partially to stimulate further research on Kuzushiji and to discover ways in which KuroNet may be deficient. Ultimately, after 3 months of competition, which saw 293 teams, 338 competitors, and 2652 submissions, the winner achieved an F1 score of 0.950. When we evaluated KuroNet on the same setup, we found that it achieved an F1 score 0.902, which would have put it in 12th place—which, although acceptable, remains well below the best performing solutions.

Future Research: The work done by CODH has already led to substantial progress in transcribing Kuzushiji documents, however, the overall problem of unlocking the knowledge of historical documents is far from solved.