There are a number of designs for an online advertising system that allow for behavioral targeting without revealing user online behavior or user interest profiles to the ad network. Although these designs purport to be practical solutions, none of them adequately consider the role of ad auctions, which today are central to the operation of online advertising systems. Moreover, none of the proposed designs have been deployed in real-life settings. In this thesis, we present an effort to fill this gap. First, we address the challenge of running ad auctions that leverage user profiles while keeping the profile information private. We define the problem, broadly explore the solution space, and discuss the pros and cons of these solutions. We analyze the performance of our solutions using data from Microsoft Bing advertising auctions. We conclude that, while none of our auctions are ideal in all respects, they are adequate and practical solutions. Second, we build and evaluate a fully functional prototype of a practical privacy-preserving ad system at a reasonably large scale. With more than 13K opted-in users, our system was in operation for over two months serving an average of 4800 active users daily. During the last month alone, we registered 790K ad views, 417 clicks, and even a small number of product purchases. Our system obtained click-through rates comparable with those for Google display ads. In addition, our prototype is equipped with a differentially private analytics mechanism, which we used as the primary means for gathering experimental data. In this thesis, we describe our first-hand experience and lessons learned in running the world's first fully operational “private-by-design” behavioral advertising and analytics system.
Open distributed systems are a class of distributed systems where (i) only partial information about the environment, in which they are running, is present, (ii) new resources may become available at runtime, and (iii) a subsystem may become aware of other subsystems after some interaction. Modeling and implementing such systems correctly is a complex task due to the openness and the dynamicity aspects. One way to ensure that the resulting systems behave correctly is to utilize formal verification.
Formal verification requires an adequate semantic model of the implementation, a specification of the desired behavior, and a reasoning technique. The actor model is a semantic model that captures the challenging aspects of open distributed systems by utilizing actors as universal primitives to represent system entities and allowing them to create new actors and to communicate by sending directed messages as reply to received messages. To enable compositional reasoning, where the reasoning task is reduced to independent verification of the system parts, semantic entities at a higher level of abstraction than actors are needed.
This thesis proposes an automaton model and combines sound reasoning techniques to compositionally verify implementations of open actor systems. Based on I/O automata, the model allows automata to be created dynamically and captures dynamic changes in communication patterns. Each automaton represents either an actor or a group of actors. The specification of the desired behavior is given constructively as an automaton. As the basis for compositionality, we formalize a component notion based on the static structure of the implementation instead of the dynamic entities (the actors) occurring in the system execution. The reasoning proceeds in two stages. The first stage establishes the connection between the automata representing single actors and their implementation description by means of weakest liberal preconditions. The second stage employs this result as the basis for verifying whether a component specification is satisfied. The verification is done by building a simulation relation from the automaton representing the implementation to the component's automaton. Finally, we validate the compositional verification approach through a number of examples by proving correctness of their actor implementations with respect to system specifications.
The goal of this work is to develop statistical natural language models and processing techniques
based on Recurrent Neural Networks (RNN), especially the recently introduced Long Short-
Term Memory (LSTM). Due to their adapting and predicting abilities, these methods are more
robust, and easier to train than traditional methods, i.e., words list and rule-based models. They
improve the output of recognition systems and make them more accessible to users for browsing
and reading. These techniques are required, especially for historical books which might take
years of effort and huge costs to manually transcribe them.
The contributions of this thesis are several new methods which have high-performance computing and accuracy. First, an error model for improving recognition results is designed. As
a second contribution, a hyphenation model for difficult transcription for alignment purposes
is suggested. Third, a dehyphenation model is used to classify the hyphens in noisy transcription. The fourth contribution is using LSTM networks for normalizing historical orthography.
A size normalization alignment is implemented to equal the size of strings, before the training
phase. Using the LSTM networks as a language model to improve the recognition results is
the fifth contribution. Finally, the sixth contribution is a combination of Weighted Finite-State
Transducers (WFSTs), and LSTM applied on multiple recognition systems. These contributions
will be elaborated in more detail.
Context-dependent confusion rules is a new technique to build an error model for Optical
Character Recognition (OCR) corrections. The rules are extracted from the OCR confusions
which appear in the recognition outputs and are translated into edit operations, e.g., insertions,
deletions, and substitutions using the Levenshtein edit distance algorithm. The edit operations
are extracted in a form of rules with respect to the context of the incorrect string to build an
error model using WFSTs. The context-dependent rules assist the language model to find the
best candidate corrections. They avoid the calculations that occur in searching the language
model and they also make the language model able to correct incorrect words by using context-
dependent confusion rules. The context-dependent error model is applied on the university of
Washington (UWIII) dataset and the Nastaleeq script in Urdu dataset. It improves the OCR
results from an error rate of 1.14% to an error rate of 0.68%. It performs better than the
state-of-the-art single rule-based which returns an error rate of 1.0%.
This thesis describes a new, simple, fast, and accurate system for generating correspondences
between real scanned historical books and their transcriptions. The alignment has many challenges, first, the transcriptions might have different modifications, and layout variations than the
original book. Second, the recognition of the historical books have misrecognition, and segmentation errors, which make the alignment more difficult especially the line breaks, and pages will
not have the same correspondences. Adapted WFSTs are designed to represent the transcription. The WFSTs process Fraktur ligatures and adapt the transcription with a hyphenations
model that allows the alignment with respect to the varieties of the hyphenated words in the line
breaks of the OCR documents. In this work, several approaches are implemented to be used for
the alignment such as: text-segments, page-wise, and book-wise approaches. The approaches
are evaluated on German calligraphic (Fraktur) script historical documents dataset from “Wan-
derungen durch die Mark Brandenburg” volumes (1862-1889). The text-segmentation approach
returns an error rate of 2.33% without using a hyphenation model and an error rate of 2.0%
using a hyphenation model. Dehyphenation methods are presented to remove the hyphen from
the transcription. They provide the transcription in a readable and reflowable format to be used
for alignment purposes. We consider the task as classification problem and classify the hyphens
from the given patterns as hyphens for line breaks, combined words, or noise. The methods are
applied on clean and noisy transcription for different languages. The Decision Trees classifier
returns better performance on UWIII dataset and returns an accuracy of 98%. It returns 97%
on Fraktur script.
A new method for normalizing historical OCRed text using LSTM is implemented for different texts, ranging from Early New High German 14th - 16th centuries to modern forms in New
High German applied on the Luther bible. It performed better than the rule-based word-list
approaches. It provides a transcription for various purposes such as part-of-speech tagging and
n-grams. Also two new techniques are presented for aligning the OCR results and normalize the
size by using adding Character-Epsilons or Appending-Epsilons. They allow deletion and insertion in the appropriate position in the string. In normalizing historical wordforms to modern
wordforms, the accuracy of LSTM on seen data is around 94%, while the state-of-the-art combined rule-based method returns 93%. On unseen data, LSTM returns 88% and the combined
rule-based method returns 76%. In normalizing modern wordforms to historical wordforms, the
LSTM delivers the best performance and returns 93.4% on seen data and 89.17% on unknown
In this thesis, a deep investigation has been done on constructing high-performance language
modeling for improving the recognition systems. A new method to construct a language model
using LSTM is designed to correct OCR results. The method is applied on UWIII and Urdu
script. The LSTM approach outperforms the state-of-the-art, especially for unseen tokens
during training. On the UWIII dataset, the LSTM returns reduction in OCR error rates from
1.14% to 0.48%. On the Nastaleeq script in Urdu dataset, the LSTM reduces the error rate
from 6.9% to 1.58%.
Finally, the integration of multiple recognition outputs can give higher performance than a
single recognition system. Therefore, a new method for combining the results of OCR systems is
explored using WFSTs and LSTM. It uses multiple OCR outputs and votes for the best output
to improve the OCR results. It performs better than the ISRI tool, Pairwise of Multiple Sequence and it helps to improve the OCR results. The purpose is to provide correct transcription
so that it can be used for digitizing books, linguistics purposes, N-grams, and part-of-speech
tagging. The method consists of two alignment steps. First, two recognition systems are aligned
using WFSTs. The transducers are designed to be more flexible and compatible with the different symbols in line and page breaks to avoid the segmentation and misrecognition errors.
The LSTM model then is used to vote the best candidate correction of the two systems and
improve the incorrect tokens which are produced during the first alignment. The approaches
are evaluated on OCRs output from the English UWIII and historical German Fraktur dataset
which are obtained from state-of-the-art OCR systems. The Experiments show that the error
rate of ISRI-Voting is 1.45%, the error rate of the Pairwise of Multiple Sequence is 1.32%, the
error rate of the Line-to-Page alignment is 1.26% and the error rate of the LSTM approach has
the best performance with 0.40%.
The purpose of this thesis is to contribute methods providing correct transcriptions corresponding to the original book. This is considered to be the first step towards an accurate and
more effective use of the documents in digital libraries.
Today's ubiquity of visual content as driven by the availability of broadband Internet, low-priced storage, and the omnipresence of camera equipped mobile devices conveys much of our thinking and feeling as individuals and as a society. As a result the growth of video repositories is increasing at enourmous rates with content now being embedded and shared through social media. To make use of this new form of social multimedia, concept detection, the automatic mapping of semantic concepts and video content has to be extended such that concept vocabularies are synchronized with current real-world events, systems can perform scalable concept learning with thousands of concepts, and high-level information such as sentiment can be extracted from visual content. To catch up with these demands the following three contributions are made in this thesis: (i) concept detection is linked to trending topics, (ii) visual learning from web videos is presented including the proper treatment of tags as concept labels, and (iii) the extension of concept detection with adjective noun pairs for sentiment analysis is proposed.
In order for concept detection to satisfy users' current information needs, the notion of fixed concept vocabularies has to be reconsidered. This thesis presents a novel concept learning approach built upon dynamic vocabularies, which are automatically augmented with trending topics mined from social media. Once discovered, trending topics are evaluated by forecasting their future progression to predict high impact topics, which are then either mapped to an available static concept vocabulary or trained as individual concept detectors on demand. It is demonstrated in experiments on YouTube video clips that by a visual learning of trending topics, improvements of over 100% in concept detection accuracy can be achieved over static vocabularies (n=78,000).
To remove manual efforts related to training data retrieval from YouTube and noise caused by tags being coarse, subjective and context-depedent, this thesis suggests an automatic concept-to-query mapping for the retrieval of relevant training video material, and active relevance filtering to generate reliable annotations from web video tags. Here, the relevance of web tags is modeled as a latent variable, which is combined with an active learning label refinement. In experiments on YouTube, active relevance filtering is found to outperform both automatic filtering and active learning approaches, leading to a reduction of required label inspections by 75% as compared to an expert annotated training dataset (n=100,000).
Finally, it is demonstrated, that concept detection can serve as a key component to infer the sentiment reflected in visual content. To extend concept detection for sentiment analysis, adjective noun pairs (ANP) as novel entities for concept learning are proposed in this thesis. First a large-scale visual sentiment ontology consisting of 3,000 ANPs is automatically constructed by mining the web. From this ontology a mid-level representation of visual content – SentiBank – is trained to encode the visual presence of 1,200 ANPs. This novel approach of visual learning is validated in three independent experiments on sentiment prediction (n=2,000), emotion detection (n=807) and pornographic filtering (n=40,000). SentiBank is shown to outperform known low-level feature representations (sentiment prediction, pornography detection) or perform comparable to state-of-the art methods (emotion detection).
Altogether, these contributions extend state-of-the-art concept detection approaches such that concept learning can be done autonomously from web videos on a large-scale, and can cope with novel semantic structures such as trending topics or adjective noun pairs, adding a new dimension to the understanding of video content.
In this thesis, an approach is presented that turns the currently unstructured process of automotive hazard analysis and risk assessments (HRA), which relies on creativity techniques, into a structured, model-based approach that makes the HRA results less dependent on experts' experience, more consistent, and gives them higher quality. The challenge can be subdivided into two steps. The first step is to improve the HRA as it is performed in current practice. The second step is to go beyond the current practice and consider not only single service failures as relevant hazards, but also multiple service failures. For the first step, the most important aspect is to formalize the operational situation of the system and to determine its likelihood. Current approaches use natural-language textual descriptions, which makes it hard to ensure consistency and increase efficiency through reuse. Furthermore, due to ambiguity in natural language, it is difficult to ensure consistent likelihood estimates for situations.
The main aspect of the second step is that considering multiple service failures as hazards implies that one needs to analyze an exponential number of hazards. Due to the fact that hazard assessments are currently done purely manually, considering multiple service failures is not possible. The only way to approach this challenge is to formalize the HRA and make extensive use of automation support.
In SAHARA we handle these challenges by first introducing a model-based representation of an HRA with GOBI. Based on this, we formalized the representation of operational situations and their likelihood assessment in OASIS and HEAT, respectively. We show that more consistent situation assessments are possible and that situations (including their likelihood) can be efficiently reused. The second aspect, coping with multiple service failures, is addressed in ARID. We show that using our tool-supported HRA approach, 100% coverage of all possible hazards (including multiple service failures) can be achieved by relying on very limited manual effort. We furthermore show that not considering multiple service failures results in insufficient safety goals.
We intend to find optimal deterministic and randomized algorithms for three related problems: multivariate integration, parametric multivariate integration, and parametric initial value problems. The main interest is concentrated on the question, in how far randomization affects the precision of an approximation. We want to understand when and to which extent randomized algorithms are superior to deterministic ones.
All problems are studied for Banach space valued input functions. The analysis of Banach space valued problems is motivated by the investigation of scalar parametric problems; these can be understood as particular cases of Banach space valued problems. The gain achieved by randomization depends on the underlying Banach space.
For each problem, we introduce deterministic and randomized algorithms and provide the corresponding convergence analysis.
Moreover, we also provide lower bounds for the general Banach space valued settings, and thus, determine the complexity of the problems. It turns out that the obtained algorithms are order optimal in the deterministic setting. In the randomized setting, they are order optimal for certain classes of Banach spaces, which includes the L_p spaces and any finite dimensional Banach space. For general Banach spaces, they are optimal up to an arbitrarily small gap in the order of convergence.
The last couple of years have marked the entire field of information technology with the introduction of a new global resource, called data. Certainly, one can argue that large amounts of information and highly interconnected and complex datasets were available since the dawn of the computer and even centuries before. However, it has been only a few years since digital data has exponentially expended, diversified and interconnected into an overwhelming range of domains, generating an entire universe of zeros and ones. This universe represents a source of information with the potential of advancing a multitude of fields and sparking valuable insights. In order to obtain this information, this data needs to be explored, analyzed and interpreted.
While a large set of problems can be addressed through automatic techniques from fields like artificial intelligence, machine learning or computer vision, there are various datasets and domains that still rely on the human intuition and experience in order to parse and discover hidden information. In such instances, the data is usually structured and represented in the form of an interactive visual representation that allows users to efficiently explore the data space and reach valuable insights. However, the experience, knowledge and intuition of a single person also has its limits. To address this, collaborative visualizations allow multiple users to communicate, interact and explore a visual representation by building on the different views and knowledge blocks contributed by each person.
In this dissertation, we explore the potential of subjective measurements and user emotional awareness in collaborative scenarios as well as support flexible and user- centered collaboration in information visualization systems running on tabletop displays. We commence by introducing the concept of user-centered collaborative visualization (UCCV) and highlighting the context in which it applies. We continue with a thorough overview of the state-of-the-art in the areas of collaborative information visualization, subjectivity measurement and emotion visualization, combinable tabletop tangibles, as well as browsing history visualizations. Based on a new web browser history visualization for exploring user parallel browsing behavior, we introduce two novel user-centered techniques for supporting collaboration in co-located visualization systems. To begin with, we inspect the particularities of detecting user subjectivity through brain-computer interfaces, and present two emotion visualization techniques for touch and desktop interfaces. These visualizations offer real-time or post-task feedback about the users’ affective states, both in single-user and collaborative settings, thus increasing the emotional self-awareness and the awareness of other users’ emotions. For supporting collaborative interaction, a novel design for tabletop tangibles is described together with a set of specifically developed interactions for supporting tabletop collaboration. These ring-shaped tangibles minimize occlusion, support touch interaction, can act as interaction lenses, and describe logical operations through nesting operations. The visualization and the two UCCV techniques are each evaluated individually capturing a set of advantages and limitations of each approach. Additionally, the collaborative visualization supported by the two UCCV techniques is also collectively evaluated in three user studies that offer insight into the specifics of interpersonal interaction and task transition in collaborative visualization. The results show that the proposed collaboration support techniques do not only improve the efficiency of the visualization, but also help maintain the collaboration process and aid a balanced social interaction.
Self-adaptation allows software systems to autonomously adjust their behavior during run-time by handling all possible
operating states that violate the requirements of the managed system. This requires an adaptation engine that receives adaptation
requests during the monitoring process of the managed system and responds with an automated and appropriate adaptation
response. During the last decade, several engineering methods have been introduced to enable self-adaptation in software systems.
However, these methods lack addressing (1) run-time uncertainty that hinders the adaptation process and (2) the performance
impacts resulted from the complexity and the large number of the adaptation space. This paper presents CRATER, a framework
that builds an external adaptation engine for self-adaptive software systems. The adaptation engine, which is built on Case-based
Reasoning, handles the aforementioned challenges together. This paper is braced with an experiment illustrating the benefits of
this framework. The experimental results shows the potential of CRATER in terms handling run-time uncertainty and adaptation
remembrance that enhances the performance for large number of adaptation space.
Sequential Consistency (SC) is the memory model traditionally applied by programmers and verification tools for the analysis of multithreaded programs.
SC guarantees that instructions of each thread are executed atomically and in program order.
Modern CPUs implement memory models that relax the SC guarantees: threads can execute instructions out of order, stores to the memory can be observed by different threads in different order.
As a result of these relaxations, multithreaded programs can show unexpected, potentially undesired behaviors, when run on real hardware.
The robustness problem asks if a program has the same behaviors under SC and under a relaxed memory model.
Behaviors are formalized in terms of happens-before relations — dataflow and control-flow relations between executed instructions.
Programs that are robust against a memory model produce the same results under this memory model and under SC.
This means, they only need to be verified under SC, and the verification results will carry over to the relaxed setting.
Interestingly, robustness is a suitable correctness criterion not only for multithreaded programs, but also for parallel programs running on computer clusters.
Parallel programs written in Partitioned Global Address Space (PGAS) programming model, when executed on cluster, consist of multiple processes, each running on its cluster node.
These processes can directly access memories of each other over the network, without the need of explicit synchronization.
Reorderings and delays introduced on the network level, just as the reorderings done by the CPUs, may result into unexpected behaviors that are hard to reproduce and fix.
Our first contribution is a generic approach for solving robustness against relaxed memory models.
The approach involves two steps: combinatorial analysis, followed by an algorithmic development.
The aim of combinatorial analysis is to show that among program computations violating robustness there is always a computation in a certain normal form, where reorderings are applied in a restricted way.
In the algorithmic development we work out a decision procedure for checking whether a program has violating normal-form computations.
Our second contribution is an application of the generic approach to widely implemented memory models, including Total Store Order used in Intel x86 and Sun SPARC architectures, the memory model of Power architecture, and the PGAS memory model.
We reduce robustness against TSO to SC state reachability for a modified input program.
Robustness against Power and PGAS is reduced to language emptiness for a novel class of automata — multiheaded automata.
The reductions lead to new decidability results.
In particular, robustness is PSPACE-complete for all the considered memory models.
Embedded systems, ranging from very simple systems up to complex controllers, may
nowadays have quite challenging real-time requirements. Many embedded systems are reactive
systems that have to respond to environmental events and have to guarantee certain real-time
constrain. Their execution is usually divided into reaction steps, where in each step, the
system reads inputs from the environment and reacts to these by computing corresponding
The synchronous Model of Computation (MoC) has proven to be well-suited for the
development of reactive real-time embedded systems whose paradigm directly reflects the
reactive nature of the systems it describes. Another advantage is the availability of formal
verification by model checking as a result of the deterministic execution based on a formal
semantics. Nevertheless, the increasing complexity of embedded systems requires to compensate
the natural disadvantages of model checking that suffers from the well-known state-space
explosion problem. It is therefore natural to try to integrate other verification methods with
the already established techniques. Hence, improvements to encounter these problems are
required, e.g., appropriate decomposition techniques, which encounter the disadvantages
of the model checking approach naturally. But defining decomposition techniques for synchronous
language is a difficult task, as a result of the inherent parallelism emerging from
the synchronous broadcast communication.
Inspired by the progress in the field of desynchronization of synchronous systems by
representing them in other MoCs, this work will investigate the possibility of adapting and use
methods and tools designed for other MoC for the verification of systems represented in the
synchronous MoC. Therefore, this work introduces the interactive verification of synchronous
systems based on the basic foundation of formal verification for sequential programs – the
Hoare calculus. Due to the different models of computation several problems have to be
solved. In particular due to the large amount of concurrency, several parts of the program
are active at the same point of time. In contrast to sequential programs, a decomposition
in the Hoare-logic style that is in some sense a symbolic execution from one control flow
location to another one requires the consideration of several flows here. Therefore, different
approaches for the interactive verification of synchronous systems are presented.
Additionally, the representation of synchronous systems by other MoCs and the influence
of the representation on the verification task by differently embedding synchronous system
in a single verification tool are elaborated.
The feasibility is shown by integration of the presented approach with the established
model checking methods by implementing the AIFProver on top of the Averest system.