Today, many publishers (e.g., websites, mobile application developers) commonly use third-party analytics services and social widgets. Unfortunately, this scheme allows these third parties to track individual users across the web, creating privacy concerns and leading to reactions to prevent tracking via blocking, legislation and standards. While improving user privacy, these efforts do not consider the functionality third-party tracking enables publishers to use: to obtain aggregate statistics about their users and increase their exposure to other users via online social networks. Simply preventing third-party tracking without replacing the functionality it provides cannot be a viable solution; leaving publishers without essential services will hurt the sustainability of the entire ecosystem.
Computer Vision (CV) problems, such as image classification and segmentation, have traditionally been solved by manual construction of feature hierarchies or incorporation of other prior knowledge. However, noisy images, varying viewpoints and lighting conditions of images, and clutters in real-world images make the problem challenging. Such tasks cannot be efficiently solved without learning from data. Therefore, many Deep Learning (DL) approaches have recently been successful for various CV tasks, for instance, image classification, object recognition and detection, action recognition, video classification, and scene labeling. The main focus of this thesis is to investigate a purely learning-based approach, particularly, Multi-Dimensional LSTM (MD-LSTM) recurrent neural networks to tackle the challenging CV tasks, classification and segmentation on 2D and 3D image data. Due to the structural nature of MD-LSTM, the network learns directly from raw pixel values and takes the complex spatial dependencies of each pixel into account. This thesis provides several key contributions in the field of CV and DL.
Several MD-LSTM network architectural options are suggested based on the type of input and output, as well as the requiring tasks. Including the main layers, which are an input layer, a hidden layer, and an output layer, several additional layers can be added such as a collapse layer and a fully connected layer. First, a single Two Dimensional LSTM (2D-LSTM) is directly applied on texture images for segmentation and show improvement over other texture segmentation methods. Besides, a 2D-LSTM layer with a collapse layer is applied for image classification on texture and scene images and have provided an accurate classification results. In addition, a deeper model with a fully connected layer is introduced to deal with more complex images for scene labeling and outperforms the other state-of-the-art methods including the deep Convolutional Neural Networks (CNN). Here, several input and output representation techniques are introduced to achieve the robust classification. Randomly sampled windows as input are transformed in scaling and rotation, which are integrated to get the final classification. To achieve multi-class image classification on scene images, several pruning techniques are introduced. This framework provides a good results in automatic web-image tagging. The next contribution is an investigation of 3D data with MD-LSTM. The traditional cuboid order of computations in Multi-Dimensional LSTM (MD-LSTM) is re-arranged in pyramidal fashion. The resulting Pyramidal Multi-Dimensional LSTM (PyraMiD-LSTM) is easy to parallelize, especially for 3D data such as stacks of brain slice images. PyraMiD-LSTM was tested on 3D biomedical volumetric images and achieved best known pixel-wise brain image segmentation results and competitive results on Electron Microscopy (EM) data for membrane segmentation.
To validate the framework, several challenging databases for classification and segmentation are proposed to overcome the limitations of current databases. First, scene images are randomly collected from the web and used for scene understanding, i.e., the web-scene image dataset for multi-class image classification. To achieve multi-class image classification, the training and testing images are generated in a different setting. For training, images belong to a single pre-defined category which are trained as a regular single-class image classification. However, for testing, images containing multi-classes are randomly collected by web-image search engine by querying the categories. All scene images include noise, background clutter, unrelated contents, and also diverse in quality and resolution. This setting can make the database possible to evaluate for real-world applications. Secondly, an automated blob-mosaics texture dataset generator is introduced for segmentation. Random 2D Gaussian blobs are generated and filled with random material textures. These textures contain diverse changes in illumination, scale, rotation, and viewpoint. The generated images are very challenging since they are even visually hard to separate the related regions.
Overall, the contributions in this thesis are major advancements in the direction of solving image analysis problems with Long Short-Term Memory (LSTM) without the need of any extra processing or manually designed steps. We aim at improving the presented framework to achieve the ultimate goal of accurate fine-grained image analysis and human-like understanding of images by machines.
Today's ubiquity of visual content as driven by the availability of broadband Internet, low-priced storage, and the omnipresence of camera equipped mobile devices conveys much of our thinking and feeling as individuals and as a society. As a result the growth of video repositories is increasing at enourmous rates with content now being embedded and shared through social media. To make use of this new form of social multimedia, concept detection, the automatic mapping of semantic concepts and video content has to be extended such that concept vocabularies are synchronized with current real-world events, systems can perform scalable concept learning with thousands of concepts, and high-level information such as sentiment can be extracted from visual content. To catch up with these demands the following three contributions are made in this thesis: (i) concept detection is linked to trending topics, (ii) visual learning from web videos is presented including the proper treatment of tags as concept labels, and (iii) the extension of concept detection with adjective noun pairs for sentiment analysis is proposed.
In order for concept detection to satisfy users' current information needs, the notion of fixed concept vocabularies has to be reconsidered. This thesis presents a novel concept learning approach built upon dynamic vocabularies, which are automatically augmented with trending topics mined from social media. Once discovered, trending topics are evaluated by forecasting their future progression to predict high impact topics, which are then either mapped to an available static concept vocabulary or trained as individual concept detectors on demand. It is demonstrated in experiments on YouTube video clips that by a visual learning of trending topics, improvements of over 100% in concept detection accuracy can be achieved over static vocabularies (n=78,000).
To remove manual efforts related to training data retrieval from YouTube and noise caused by tags being coarse, subjective and context-depedent, this thesis suggests an automatic concept-to-query mapping for the retrieval of relevant training video material, and active relevance filtering to generate reliable annotations from web video tags. Here, the relevance of web tags is modeled as a latent variable, which is combined with an active learning label refinement. In experiments on YouTube, active relevance filtering is found to outperform both automatic filtering and active learning approaches, leading to a reduction of required label inspections by 75% as compared to an expert annotated training dataset (n=100,000).
Finally, it is demonstrated, that concept detection can serve as a key component to infer the sentiment reflected in visual content. To extend concept detection for sentiment analysis, adjective noun pairs (ANP) as novel entities for concept learning are proposed in this thesis. First a large-scale visual sentiment ontology consisting of 3,000 ANPs is automatically constructed by mining the web. From this ontology a mid-level representation of visual content – SentiBank – is trained to encode the visual presence of 1,200 ANPs. This novel approach of visual learning is validated in three independent experiments on sentiment prediction (n=2,000), emotion detection (n=807) and pornographic filtering (n=40,000). SentiBank is shown to outperform known low-level feature representations (sentiment prediction, pornography detection) or perform comparable to state-of-the art methods (emotion detection).
Altogether, these contributions extend state-of-the-art concept detection approaches such that concept learning can be done autonomously from web videos on a large-scale, and can cope with novel semantic structures such as trending topics or adjective noun pairs, adding a new dimension to the understanding of video content.