Handwritten Text Recognition of Ukrainian Manuscripts in the 21st Century: Possibilities, Challenges, and the Future of the First Generic AI-based Model

Authors

DOI:

https://doi.org/10.18523/2313-4895.11.2024.226-247

Keywords:

Ukrainian, handwritten text recognition, manuscripts, handwriting, AI

Abstract

This article reports on developing and evaluating a generic Handwritten Text Recognition (HTR) model created for the automatic computer-assisted transcription of Ukrainian handwriting publicly available via the HTR platform Transkribus. The model’s training process encompasses diverse datasets, including historical manuscripts by renowned poets Taras Shevchenko and Lesya Ukrainka, along with private correspondence used for the General Regionally Annotated Corpus of Ukrainian (GRAC) and a diary procured at the Holodomor Museum collection. We evaluate the model’s performance by comparing its theoretical accuracy, with a character error rate (CER) of 4.2%, against its practical efficacy when augmented with an AI-based language model for Ukrainian and a Large Language Model. The model is versatile and functional and can thus be applied for mass-digitization of Ukrainian cultural heritage. In our outlook section, we identify possibilities for further improving the model.

Author Biographies

Aleksej Tikhonov, University of Freiburg, University of Zurich

Aleksej Tikhonov is a Postdoctoral Researcher in Slavic Linguistics. He earned his doctorate on the 18th century’ Czech manuscripts of Protestant refugees in exile in Berlin. He is currently working on his habilitation on the identity-forming function of Slavic languages in German rap. Dr. Tikhonov’s research interests include the application of digital humanities methods in philology, corpus linguistics, language contact, and the use of Slavic languages in the online world.

Achim Rabus, University of Freiburg

Achim Rabus is a Full Professor of Slavic Philology (Linguistics) and the Managing Director of the Slavic Seminar at the University of Freiburg, Germany. He earned his doctorate with a dissertation on the language of East Slavic spiritual songs in a cultural context and completed his habilitation focusing on the role of language contact in the development of Slavic standard languages. Professor Rabus’s research interests encompass Paleoslavistics, the computer-assisted transcription and analysis of Slavic languages using AI methods, and corpus linguistics.

References

Adobe Acrobat. Easily edit your scanned PDF documents with OCR. Accessed June 14, 2023. https://www.adobe.com/acrobat/how-to/ocr-software-convert-pdf-to-text.html.

Bodnia, Yevhen, & Mariia Kozulia. “Web Application System of Handwritten Text Recognition.” COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine. https://ceur-ws.org/Vol-2870/paper98.pdf.

Burlacu, Constanța, & Achim Rabus. “Digitising (Romanian) Cyrillic Using Transkribus: New Perspectives.” Diacronia 14 (December 12, 2021): A196(1–9). https://doi.org/10.17684/i14A196en.

Church Slavonic (2): Free Public AI Model for Handwritten Text Recognition with Transkribus, accessed July 27, 2023, https://readcoop.eu/model/church-slavonic-2/.

GRAC, accessed August 3, 2023. https://uacorpus.org/Kyiv/ua.

Klekovska, Mimoza, Igor Nedelkovski, Vera Stojcevska-Antic, & Dragan Mihajlov. “Automatic Letter Style Recognition of Churchslavic Manuscripts.” In Proceedings of Papers of the 44th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST 2009), Veliko Tarnovo, Bulgaria, June 25–27, 2009. Vol. 1, 221–4 (Sofia, 2009).

Kornienko, Sergei, Fedor Cherepanov, & Leonid Iasnitckii. “Raspoznavanie tekstov rukopisnykh i staropechatnykh knig na osnove neirosetevykh tekhnologii” [“OCR of manuscripts and early printed books using neural networks”]. Paper presented at conference “Modern Information Technologies and Written Heritage: from Ancient Texts to Electronic Libraries” – El’Manuscript-08, Kazan, Republic of Tatarstan, August 25–30, 2008. https://textualheritage.org/ru/el-manusctipt-08-/52.html.

Klyment, Kvitka, ed. Narodni melodii. Z holosu Lesi Ukrainky [Folk Melodies. From the Voice of Lesya Ukrainka]. Vol. 1. Kyiv, 1917.

Martinovska, Cveta, Mimoza Klekovska, Igor Nedelkovski, & Dragan Kaevski. “Methodologies for Recognition of Old Slavic Cyrillic Characters.” International Journal of Computational Intelligence Studies 2, no. 3–4 (January 2013): 264–87. https://doi.org/10.1504/IJCISTUDIES.2013.057639.

Muehlberger, Guenter et al. “Transforming Scholarship in the Archives through Handwritten Text Recognition: Transkribus as a Case Study.” Journal of Documentation 75, no. 5 (2019): 954–76. https://doi.org/10.1108/JD-07-2018-0114.

MultiHTR – Multilinguale Handschriftenerkennung. Projektbeschreibung. Аccessed July 27, 2023. https://www.multihtr.uni-freiburg.de.

Namboodiri, Anoop M., & Anil K. Jain. “Online Handwritten Script Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 26, no. 1 (January 2004): 124–30. https://doi.org/10.1109/TPAMI.2004.1261096.

Optical Character Recognition (and more) for everyone, accessed June 14, 2023, https://www.ocr4all.org/.

Rabus, Achim. “Recognizing Handwritten Text in Slavic Manuscripts: A Neural-Network Approach Using Transkribus’. Scripta & E-Scripta 19 (2019): 9–32. 246 Kyiv-Mohyla Humanities Journal 11 (2024)

Ridni: henealohichne tovarystvo. Doslidzhennia rodovodu v Ukraini [Ridni: genealogical society. Research on genealogy in Ukraine]. Аccessed: July 27, 2023, https://ridni.org/.

Savic, M. D., & M. Bojovic. “Recognition of Handwritten Text: Basic Concepts of a New Approach.” In 4th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Services. TELSIKS’99 (Cat. No.99EX365), 2:468–71, vol. 2, 1999. https://doi.org/10.1109/TELSKS.1999.806253.

Saving Ukrainian Cultural Heritage Online. Accessed August 3, 2023, https://www.sucho.org/.

Scripta / escriptorium. GitLab, accessed July 27, 2023. https://gitlab.com/scripta/escriptorium.

Shevchenko, Taras. “Posadzhu kolo khatyny...”: virsh, rukopys, bilovyi avtohraf ostannoi redaktsii virsha “Podrazhaniie” v rukopysnomu zbirnyku P. O. Kulisha [“I Will Plant Near the Hut...”: poem, manuscript, clean autograph of the last edition of the poem “Imitation” in the manuscript collection of P. O. Kulish]. 187?-188? No. 28438. Fond I: Literaturni materialy [Literary materials]. Manuscript Institute of the V. I. Vernadskyi National Library, Kyiv. http://irbis-nbuv.gov.ua/dlib/item/0000613.

Shevchenko, Taras. “Topolia” ta inshi virshi. Zbirnyk poezii. Rukopysnyi spysok [“Topolia” and other poems. A collection of poems. Handwritten list]. 18?? No. 7448. Fond I: Literaturni materialy [Literary materials]. Manuscript Institute of the V. I. Vernadskyi National Library, Kyiv.

Tikhonov, Aleksej, Lesley Loew, Milanka Matić-Chalkitis, Martin Meindl, & Achim Rabus. “Multilingual Handwritten Text Recognition (MultiHTR) or Reading Your Grandma’s Old Letters in German, Russian, Serbian, and Ottoman Turkish with Artificial Intelligence.” In The Palgrave Handbook of Digital and Public Humanities, edited by Anne Schwan and Tara Thomson, 215–33. Cham: Springer International Publishing, 2022. https://doi.org/10.1007/978-3-031-11886-9_12.

Ukrainian generic handwriting: Free Public AI Model for Handwritten Text Recognition with Transkribus. Accessed July 27, 2023. https://readcoop.eu/model/ukrainian-generichandwriting/.

Ukrainian language for ABBYY FineReader Professional Edition 8.0.1126.0. Accessed June 14, 2023. https://ukrainian-language-for-abbyy-finereader-professional-edition.updatestar.com/.

Unlock the past with Transkribus. Accessed July 27, 2023. https://www.transkribus.org.

What is the difference between OCR (Optical Character Recognition) and HTR? https://readcoop.eu/transkribus/help/what-is-the-difference-between-ocr-optical-characterrecognition-and-htr/.

Winkler, Alexander. “OCR4All Tools (Cyrillic).” HTML. 2020. Reprint, GitHub, 18 May 2021. https://github.com/alexander-winkler/ocr4all_tools.

Zelenskyi, Volodymyr. “Vziav uchast u konferentsii v Haazi, meta yakoi – poriatunok mizhnarodnoho prava...” [“Participated in a conference in The Hague, the goal of which was to save international law...”]. July 14, 2022. https://t.me/V_Zelenskiy_official/2534. Аccessed: July 25, 2023.

Downloads

Published

2024-12-30

How to Cite

Tikhonov, A., & Rabus, A. (2024). Handwritten Text Recognition of Ukrainian Manuscripts in the 21st Century: Possibilities, Challenges, and the Future of the First Generic AI-based Model. Kyiv-Mohyla Humanities Journal, (11), 226–247. https://doi.org/10.18523/2313-4895.11.2024.226-247