I.7 DOCUMENT AND TEXT PROCESSING (H.4-5) (REVISED)
Refine
Year of publication
- 2021 (1)
Document Type
- Doctoral Thesis (1)
Language
- English (1)
Has Fulltext
- yes (1)
Faculty / Organisational entity
Optical Character Recognition (OCR) is one of the central problems in pattern recognition. Its
applications have played a great role in the digitization of document images collected from het-
erogeneous sources. Many of the well-known scripts have OCR systems with sufficiently high
performance that enables OCR applications in industrial/commercial settings. However, OCR sys-
tems yield very-good results only on a narrow domain and very-specific use cases. Thus, it is still
a challenging task, and there are other exotic languages with indigenous scripts, such as Amharic,
for which no well-developed OCR systems exist.
As many as 100 million people speak Amharic, and it is an official working language of Ethiopia.
Amharic script contains about 317 different alphabets derived from 34 consonants with small changes.
The change involves shortening or elongating one of its main legs or adding small diacritics to the
right, left, top, or bottom of the consonant character. Such modifications lead the characters to have
similar shapes and make the recognition task complex, but this is particularly interesting for charac-
ter recognition research. So far, Amharic script recognition models are developed based on classical
machine learning techniques, and they are very limited in addressing the issues for Amharic OCR.
The motivation of this thesis is, therefore, to explore and tailor contemporary deep learning tech-
niques for the OCR of Amharic.
This thesis addresses the challenges in Amharic OCR through two main contributions. The first
contribution is an algorithmic contribution in which we investigate deep learning approaches that
suit the demand for Amharic OCR. The second is a technical contribution that comprises several
works towards the OCR model development; thus, it introduces a new Amharic database consisting
of collections of images annotated at a character and text-line level. It also presents a novel CNN-
based framework designed by leveraging the grapheme of characters in Fidel-Gebeta (where Fidel-
Gebeta consists of the full set of Amharic characters in matrix structure) and achieves 94.97%
overall character recognition accuracy.
In addition to character level methods, text-line level methods are also investigated and devel-
oped based on sequence-to-sequence learning. These models avoid several of the pre-processing
stages used in prior works by eliminating the need to segment individual characters. In this design,
we use a stack of CNNs, before the Bi-LSTM layers and train from end-to-end. This model out-
performs the LSTM-CTC based network, on average, by a CER of 3.75% with the ADOCR test
set. Motivated by the success of attention, in addressing the problems’ of long sequences in Neural
Machine Translation (NMT), we proposed a novel attention-based methodology by blending the
attention mechanism into CTC objective function. This model performs far better than the existing
techniques with a CER of 1.04% and 0.93% on printed and synthetic text-line images respectively.
Finally, this thesis provides details on various tasks that have been performed for the development
of Amharic OCR. As per our empirical analysis, the majority of the errors are due to poor annotation
of the dataset. As future work, the methods proposed in this thesis should be further investigated
and extended to deal with handwritten and historical Amharic documents.