I.4 IMAGE PROCESSING AND COMPUTER VISION (REVISED)
Refine
Document Type
- Doctoral Thesis (4)
Has Fulltext
- yes (4)
Keywords
- Akquisition (1)
- Bildverarbeitung (1)
- Layout (1)
- Optische Zeichenerkennung (1)
- Rekonstruktion (1)
- Time-motion-Ultraschallkardiographie (1)
- Ultraschallkardiographie (1)
- Visualisierung (1)
- data annotation (1)
- document analysis (1)
Faculty / Organisational entity
3D hand pose and shape estimation from a single depth image is a challenging computer vision and graphics problem with many applications such as
human computer interaction and animation of a personalized hand shape in
augmented reality (AR). This problem is challenging due to several factors
for instance high degrees of freedom, view-point variations and varying hand
shapes. Hybrid approaches based on deep learning followed by model fitting
preserve the structure of hand. However, a pre-calibrated hand model limits
the generalization of these approaches. To address this limitation, we proposed a novel hybrid algorithm for simultaneous estimation of 3D hand pose
and bone-lengths of a hand model which allows training on datasets that contain varying hand shapes. On the other hand, direct joint regression methods
achieve high accuracy but they do not incorporate the structure of hand in
the learning process. Therefore, we introduced a novel structure-aware algorithm which learns to estimate 3D hand pose jointly with new structural constraints. These constraints include fingers lengths, distances of joints along
the kinematic chain and fingers inter-distances. Learning these constraints
help to maintain a structural relation between the estimated joint keypoints.
Previous methods addressed the problem of 3D hand pose estimation. We
open a new research topic and proposed the first deep network which jointly
estimates 3D hand shape and pose from a single depth image. Manually annotating real data for shape is laborious and sub-optimal. Hence, we created a
million-scale synthetic dataset with accurate joint annotations and mesh files
of depth maps. However, the performance of this deep network is restricted by
limited representation capacity of the hand model. Therefore, we proposed a
novel regression-based approach in which the 3D dense hand mesh is recovered
from sparse 3D hand pose, and weak-supervision is provided by a depth image synthesizer. The above mentioned approaches regressed 3D hand meshes
from 2D depth images via 2D convolutional neural networks, which leads to
artefacts in the estimations due to perspective distortions in the images. To
overcome this limitation, we proposed a novel voxel-based deep network with
3D convolutions trained in a weakly-supervised manner. Finally, an interesting
application is presented which is in-air signature acquisition and verification
based on deep hand pose estimation. Experiments showed that depth itself is
an important feature, which is sufficient for verification.
Nowadays a large part of communication is taking place on social media platforms such as Twitter, Facebook, Instagram, or YouTube, where messages often include multimedia contents (e.g., images, GIFs or videos). Since such messages are in digital form, computers can in principle process them in order to make our lives more convenient and help us overcome arising issues. However, these goals require the ability to capture what these messages mean to us, that is, how we interpret them from our own subjective points of view. Thus, the main goal of this dissertation is to advance a machine's ability to interpret social media contents in a more natural, subjective way.
To this end, three research questions are addressed. The first question aims at answering "How to model human interpretation for machine learning?" We describe a way of modeling interpretation which allows for analyzing single or multiple ways of interpretation of both humans and computer models within the same theoretic framework. In a comprehensive survey we collect various possibilities for such a computational analysis. Particularly interesting are machine learning approaches where a single neural network learns multiple ways of interpretation. For example, a neural network can be trained to predict user-specific movie ratings from movie features and user ID, and can then be analyzed to understand how users rate movies. This is a promising direction, as neural networks are capable of learning complex patterns. However, how analysis results depend on network architecture is a largely unexplored topic. For the example of movie ratings, we show that the way of combining information for prediction can affect both prediction performance and what the network learns about the various ways of interpretation (corresponding to users).
Since some application-specific details for dealing with human interpretation only become visible when going deeper into particular use-cases, the other two research questions of this dissertation are concerned with two selected application domains: Subjective visual interpretation and gang violence prevention. The first application study deals with subjectivity that comes from personal attitudes and aims at answering "How can we predict subjective image interpretation one would expect from the general public on photo-sharing platforms such as Flickr?" The predictions in this case take the form of subjective concepts or phrases. Our study on gang violence prevention is more community-centered and considers the question "How can we automatically detect tweets of gang members which could potentially lead to violence?" There, the psychosocial codes aggression, loss and substance use serve as proxy to estimate the subjective implications of online messages.
In these two distinct application domains, we develop novel machine learning models for predicting subjective interpretations of images or tweets with images, respectively. In the process of building these detection tools, we also create three different datasets which we share with the research community. Furthermore, we see that some domains such as Chicago gangs require special care due to high vulnerability of involved users. This motivated us to establish and describe an in-depth collaboration between social work researchers and computer scientists. As machine learning is incorporating more and more subjective components and gaining societal impact, we have good reason to believe that similar collaborations between the humanities and computer science will become increasingly necessary to advance the field in an ethical way.
Generic layout analysis--process of decomposing document image into homogeneous regions for a collection of diverse document images--has many important applications in document image analysis and understanding such as preprocessing of degraded warped, camera-captured document images, high performance layout analysis of document images containing complex cursive scripts, and word spotting in historical document images at page level. Many areas in this field like generic text line extraction method are considered as elusive goals so far, still beyond the reach of the state-of-the-art methods [NJ07, LSZT07, KB06]. This thesis addresses this problem in such a way that it presents generic, domain-independent, text line extraction and text and non-text segmentation methods, and then describes some important applications, that were developed based on these methods. An overview of the key contributions of this thesis is as follows.
The first part of this thesis presents a generic text line extraction method using a combination of matched filtering and ridge detection techniques, which are commonly used in computer vision. Unlike the state-of-the-art text line extraction methods in the literature, the generic text line extraction method can be equally and robustly applied to a large variety of document image classes including scanned and camera-captured documents, binary and grayscale documents, typed-text and handwritten documents, historical and contemporary documents, and documents containing different scripts. Different standard datasets are selected for performance evaluation that belong to different categories of document images such as the UW-III [GHHP97] dataset of scanned documents, the ICDAR 2007 [GAS07] and the UMD [LZDJ08] datasets of handwritten documents, the DFKI-I [SB07] dataset of camera-captured documents, Arabic/Urdu script documents dataset, and German calligraphic (Fraktur) script historical documents dataset. The generic text line extraction method achieves 86% (n = 23,763 text lines in 650 documents) text line detection accuracy which is better than the aggregate accuracy of 73% of the best performing domain-specific state-of-the-art methods. To the best of the author's knowledge, it is the first general-purpose text line extraction method that can be equally used for a diverse collection of documents.
This thesis also presents an active contour (snake) based curled text line extraction method for warped, camera-captured document images. The presented approach is applied to DFKI-I [SB07] dataset of camera-captured, Latin script document images for curled text line extraction. It achieves above 95% (n = 3,091 text lines in 102 documents) text line detection accuracy, which is significantly better than the competing state-of-the-art curled text line extraction methods. The presented text line extraction method can also be applied to document images containing different scripts like Chinese, Devanagari, and Arabic after small modifications.
The second part of this thesis presents an improved version of the state-of-the-art multiresolution morphology (Leptonica) based text and non-text segmentation method [Blo91], which is a domain-independent page segmentation approach and can be equally applied to a diverse collection of binarized document images. It is demonstrated that the presented improvements result in an increase in segmentation accuracy from 93% to 99% (n = 113 documents).
This thesis also introduces a discriminative learning based approach for page segmentation, where a self-tunable multi-layer perceptron (MLP) classifier [BS10] is trained for distinguishing between text and non-text connected components. Unlike other classification based page segmentation approaches in the literature, the connected components based discriminative learning based approach is faster than pixel based classification methods and does not require a block segmentation method beforehand. A segmentation accuracy of $96\%$ ($n = 113$ documents) is achieved in comparison to the state-of-the-art multiresolution morphology (Leptonica) based page segmentation method [Blo91] that achieves a segmentation accuracy of 93%. In addition to text and non-text segmentation of Latin script documents, the presented approach can also be adapted for document images containing other scripts as well as for other specialized layout analysis tasks such as digit and non-digit segmentation [HBSB12], orientation detection [RBSB09], and body-text and side-note segmentation [BAESB12].
Finally, this thesis presents important applications of the two generic layout analysis techniques, ridge-based text line extraction method and the multi-resolution morphology based text and non-text segmentation method, discussed above. First, a complete preprocessing pipeline is described for removing different types of degradations from grayscale warped, camera-captured document images that includes removal of grayscale degradations such as non-uniform shadows and blurring through binarization, noise cleanup applying page frame detection, and document rectification using monocular dewarping. Each of these preprocessing steps shows significant improvement in comparison to the analyzed state-of-the-art methods in the literature. Second, a high performance layout analysis method is described for complex Arabic script document images written in different languages such as Arabic, Urdu, and Persian and different styles for example Naskh and Nastaliq. The presented layout analysis system is robust against different types of document image degradations and shows better performance for text and non-text segmentation, text line extraction, and reading order determination on a variety of Arabic and Urdu document images as compared to the state-of-the-art methods. It can be used for large scale Arabic and Urdu documents' digitization processes. These applications demonstrate that the layout analysis methods, ridge-based text line extraction and the multi-resolution morphology based text and non-text segmentation, are generic and can be applied easily to a large collection of diverse document images.
Ultraschall ist eines der am häufigsten genutzen, bildgebenden Verfahren in der Kardiologie. Dies ist durch die günstige Erzeugung, die Nicht-Invasivität und die Unschädlichkeit für die Patienten begründet. Nachteilig an den existierenden Geräten ist der Umstand, daß lediglich zwei-dimensionale Bilder generiert werden können. Zusätzlich können diese Bilder aufgrund anatomischer Gegebenheiten nicht aus einer wahlfreien Position akquiriert werden. Dies erschwert die Analyse der Daten und folglich die Diagnose. Mit dieser Arbeit wurden neue, algorithmische Aspekte des vier-dimensionalen, kardiologischen Ultraschalls ausgehend von der Akquisition der Rohdaten, deren Synchronisation und Rekonstruktion bis hin zur Visualisierung bearbeitet. In einem zusätzlichen Kapitel wurde eine neue Technik zur weiteren Aufwertung der Visualisierung, sowie zur visuellen Bearbeitung der Ultraschalldaten entwickelt. Durch die hier entwickelten Verfahren ist es möglich bestimmte Einschränkungen des kardiologischen Ultraschalls aufzuheben oder zumindest zu mildern. Hierunter zählen vor allem die Einschränkung auf zwei-dimensionale Schnittbilder, sowie die eingeschränkte Sichtwahl.