Deep Learning for Amharic Text-Image Recognition: Algorithm, Dataset and Application

  • Optical Character Recognition (OCR) is one of the central problems in pattern recognition. Its applications have played a great role in the digitization of document images collected from het- erogeneous sources. Many of the well-known scripts have OCR systems with sufficiently high performance that enables OCR applications in industrial/commercial settings. However, OCR sys- tems yield very-good results only on a narrow domain and very-specific use cases. Thus, it is still a challenging task, and there are other exotic languages with indigenous scripts, such as Amharic, for which no well-developed OCR systems exist. As many as 100 million people speak Amharic, and it is an official working language of Ethiopia. Amharic script contains about 317 different alphabets derived from 34 consonants with small changes. The change involves shortening or elongating one of its main legs or adding small diacritics to the right, left, top, or bottom of the consonant character. Such modifications lead the characters to have similar shapes and make the recognition task complex, but this is particularly interesting for charac- ter recognition research. So far, Amharic script recognition models are developed based on classical machine learning techniques, and they are very limited in addressing the issues for Amharic OCR. The motivation of this thesis is, therefore, to explore and tailor contemporary deep learning tech- niques for the OCR of Amharic. This thesis addresses the challenges in Amharic OCR through two main contributions. The first contribution is an algorithmic contribution in which we investigate deep learning approaches that suit the demand for Amharic OCR. The second is a technical contribution that comprises several works towards the OCR model development; thus, it introduces a new Amharic database consisting of collections of images annotated at a character and text-line level. It also presents a novel CNN- based framework designed by leveraging the grapheme of characters in Fidel-Gebeta (where Fidel- Gebeta consists of the full set of Amharic characters in matrix structure) and achieves 94.97% overall character recognition accuracy. In addition to character level methods, text-line level methods are also investigated and devel- oped based on sequence-to-sequence learning. These models avoid several of the pre-processing stages used in prior works by eliminating the need to segment individual characters. In this design, we use a stack of CNNs, before the Bi-LSTM layers and train from end-to-end. This model out- performs the LSTM-CTC based network, on average, by a CER of 3.75% with the ADOCR test set. Motivated by the success of attention, in addressing the problems’ of long sequences in Neural Machine Translation (NMT), we proposed a novel attention-based methodology by blending the attention mechanism into CTC objective function. This model performs far better than the existing techniques with a CER of 1.04% and 0.93% on printed and synthetic text-line images respectively. Finally, this thesis provides details on various tasks that have been performed for the development of Amharic OCR. As per our empirical analysis, the majority of the errors are due to poor annotation of the dataset. As future work, the methods proposed in this thesis should be further investigated and extended to deal with handwritten and historical Amharic documents.
Author:Birhanu Hailu Belay
URN (permanent link):urn:nbn:de:hbz:386-kluedo-62917
Advisor:Didier Stricker
Document Type:Doctoral Thesis
Language of publication:English
Publication Date:2021/03/09
Year of Publication:2021
Publishing Institute:Technische Universität Kaiserslautern
Granting Institute:Technische Universität Kaiserslautern
Acceptance Date of the Thesis:2021/02/10
Date of the Publication (Server):2021/03/10
Tag:Amharic, Attention, Factored Convolutional Neural Network, OCR
Number of page:XV, 137
Faculties / Organisational entities:Fachbereich Informatik
CCS-Classification (computer science):I. Computing Methodologies / I.7 DOCUMENT AND TEXT PROCESSING (H.4-5) (REVISED)
DDC-Cassification:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Licence (German):Creative Commons 4.0 - Namensnennung, nicht kommerziell, keine Bearbeitung (CC BY-NC-ND 4.0)