Vietnamese scientist uses AI to translate Nôm script to Quốc Ngữ script

Using a database of hundreds of millions of words, a team of scientists in Ho Chi Minh City has successfully developed an artificial intelligence (AI) application that translates Nôm script into Vietnamese script.

The research team consists of 10 lecturers from the Department of Information Technology at the University of Science and the Department of Han-Nom Studies at the Faculty of Literature, University of Social Sciences and Humanities (Vietnam National University, Ho Chi Minh City). The team has been developing the automatic transliteration system since 2020, and it is now complete. Users can access it at: tools.clc.hcmus.edu.vn.

The idea of creating an automatic translation system was conceived by Associate Professor Dr. Dinh Dien, Director of the Computational Linguistics Center at the University of Science, over 20 years ago. However, at that time, there were not many resources available for Han-Nom, nor were there advanced machine learning models. Many years later, with the emergence of deep learning models in artificial intelligence, the team began to develop this automatic translation model.

Associate Professor Dr. Dinh Dien, head of the research team, uses machine learning to translate Nôm script into Vietnamese script. (Photo: Ha An).

The research team collected Han-Nom resources from research institutes, libraries, websites, and scholars both domestically and internationally, amassing a database of hundreds of millions of words. The data is utilized through a hybrid model, combining Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models.

According to Dr. Dien, while the NMT model has a better ability to translate natural languages, SMT has advantages in transliterating Han-Nom script into Vietnamese script because it does not involve the word order changes commonly encountered in transliteration. Therefore, depending on the specific case, the team will combine models to achieve optimal results. With the development of the transliteration system operating on a website, the accuracy of the system’s translations varies by field.

Specifically, for documents in the fields of history, literature, and social sciences, the system provides over 90% accuracy. For texts related to traditional medicine and specialized documents, the model achieves an accuracy of 70%. Notably, it can translate the famous poem “Truyện Kiều” with up to 99% accuracy.

To facilitate usage, the research team is developing a model that can translate Nôm script from images. When users upload images containing Nôm script, the application will process and convert it into Vietnamese text.

Dr. Dien noted that for old texts where the characters are faded or missing strokes, the model may misidentify characters. However, the team is researching solutions to predict handwriting based on stroke patterns and contextual information across the text to accurately guess unclear characters. The image translation functionality is currently in the experimental phase and has not yet been publicly implemented. Initial test results on some low-quality text images show that the model can accurately recognize 95% of the content.

User interface of the Nôm to Vietnamese transliteration website by the research team.

The research team emphasizes that this is a non-profit project aimed at creating an accurate tool for translating Han-Nom to Vietnamese script. Through the project, the community can contribute Han-Nom resources to update the training database, making the model more comprehensive and accurate. Through the website, researchers can correct errors in misidentified Han-Nom characters or incorrect transliterations, continually improving the system.

Testing the translation ability of the model from images, with accuracy over 95%. (Photo: NVCC).

Dr. Ho Minh Quang, head of the Department of Oriental Studies at the University of Social Sciences and Humanities (Vietnam National University, Ho Chi Minh City), assesses that this research has significant implications for preserving the Han-Nom linguistic heritage. Previously, reading and understanding Nôm script was primarily confined to research circles. The team’s product can help users recognize and research information translated into Vietnamese script. He also noted the importance of community data contributions for the model to become smarter and more accurate in translations.

Nôm script still exists widely in the community, found in royal decrees, genealogy books, contracts, wills, and herbal remedies… These documents were recorded hundreds of years ago on low-quality materials that can easily deteriorate over time if not preserved under special conditions. Within these Nôm documents, there may be valuable information, but the general public often cannot understand it and must rely on those knowledgeable in Han-Nom to translate it into Vietnamese script.

The research team believes that having a tool to translate Nôm script into Vietnamese script will enable individuals unfamiliar with Han-Nom to decode valuable information from many historical documents left by their ancestors, including traditional herbal remedies still practiced in traditional medicine among the people.