A prime motivation for using XML to directly represent pieces of information is the ability of supporting ad-hoc or 'schema-later' settings. In such scenarios, modeling data under loose data constraints is essential. Of course, the flexibility of XML comes at a price: the absence of a rigid, regular, and homogeneous structure makes many aspects of data management more challenging. Such malleable data formats can also lead to severe information quality problems, because the risk of storing inconsistent and incorrect data is greatly increased. A prominent example of such problems is the appearance of the so-called fuzzy duplicates, i.e., multiple and non-identical representations of a real-world entity. Similarity joins correlating XML document fragments that are similar can be used as core operators to support the identification of fuzzy duplicates. However, similarity assessment is especially difficult on XML datasets because structure, besides textual information, may exhibit variations in document fragments representing the same real-world entity. Moreover, similarity computation is substantially more expensive for tree-structured objects and, thus, is a serious performance concern. This thesis describes the design and implementation of an effective, flexible, and high-performance XML-based similarity join framework. As main contributions, we present novel structure-conscious similarity functions for XML trees - either considering XML structure in isolation or combined with textual information -, mechanisms to support the selection of relevant information from XML trees and organization of this information into a suitable format for similarity calculation, and efficient algorithms for large-scale identification of similar, set-represented objects. Finally, we validate the applicability of our techniques by integrating our framework into a native XML database management system; in this context we address several issues around the integration of similarity operations into traditional database architectures.
We tackle the problem of obtaining statistics on content and structure of XML documents by using summaries which may provide cardinality estimations for XML query expressions. Our focus is a data-centric processing scenario in which we use a query engine to process such query expressions. We provide three new summary structures called LESS (Leaf-Element-in-Subtree), LWES (Level-Wide Element Summarization), and EXsum (Element-centered XML Summarization) which are targeted to base an estimation process in an XML query optimizer. Each of these collects structural statistical information of XML documents, and the latter (EXsum) gathers, in addition, statistics on document content. Estimation procedures and/or heuristics for specic types of query expressions of each proposed approach are developed. We have incorporated and implemented our proposals in XTC, a native XML database management system (XDBMS). With this common implementation base, we present an empirical and comparative study in which our proposals are stressed against others published in the literature, which are also incorporated into the XTC. Furthermore, an analysis is made based on criteria pertinent to a query optimizer process.
Most software systems are described in high-level model or programming languages. Their runtime behavior, however, is determined by the compiled code. For uncritical software, it may be sufficient to test the runtime behavior of the code. For safety-critical software, there is an additional aggravating factor resulting from the fact that the code must satisfy the formal specification which reflects the safety policy of the software consumer and that the software producer is obliged to demonstrate that the code is correct with respect to the specification using formal verification techniques. In this scenario, it is of great importance that static analyses and formal methods can be applied on the source code level, because this level is more abstract and better suited for such techniques. However, the results of the analyses and the verification can only be carried over to the machine code level, if we can establish the correctness of the translation. Thus, compilation is a crucial step in the development of software systems and formally verified translation correctness is essential to close the formalization chain from high-level formal methods to the machine-code level. In this thesis, I propose an approach to certifying compilers which achieves the aim of closing the formalization chain from high-level formal methods to the machine-code level by applying techniques from mathematical logic and programming language semantics. I propose an approach called foundational translation validation (FTV) in which the software producer implements an FTV system comprising a compiler and a specification and verification framework (SVF) which is implemented in higher-order logic (HOL). The most important part of the SVF is an explicit translation contract which comprises the formalizations of the source and the target languages of the compiler and the formalization of a binary translation correctness predicate corrTrans(S,T) for source programs S and target programs T. The formalizations of the languages are realized as deep embeddings in HOL. This enables one to declare the whole program in a formalized language as a HOL constant. The predicate formally specifies when T is considered to be a correct translation of S. Its definition is explicitly based on the program semantics definitions provided by the translation contract. Subsequent to the translation, the compiler translates the source and the target programs into their syntactic representations as HOL constants, S and T, and generates a proof of corrTrans(S,T). We call a compiler which follows the FTV approach a proof generating compiler. Our approach borrows the idea of representing programs in correctness proofs as logic constants from the foundational proof-carrying code (FPCC) approach. Novel features that distinquish our approach from further approaches to certifying compilers, such as proof-carrying code (PCC) and translation validation (TV) are the following: Firstly, the presence of an explicit translation contract formalized in HOL: The approaches PCC and TV do not formalize a translation contract explicitly. Instead of this, they incorporate operational semantics and translation correctness criterion in translation validation tools on the programming language level. Secondly, representation of programs in correctness proofs as logic constants: The approaches PCC and the TV translate programs into their representations as semantic abstractions that serve as inputs for translation validation tools. Thirdly, certification of program transformation chains: Unlike the TV approach, which certifies single program transformations, the FTV approach achieves the aim of certifying whole chains of program transformations. This is possible due to the fact that the translation contract provides, for all programming languages involved in the program transformation chain, definitions of program semantics functions which map programs to mathematical objects that are elements of a set with an (at least) partial order "<=". Then, the proof makes use of the fact that the relation "<=" is transitive. In this thesis, the feasibility of the FTV approach is exemplified by the implementation of an FTV system. The system comprises a compiler front-end that certifies its optimization phase and an accompanying SVF that is implemented in the theorem prover Isabelle/HOL. The compiler front-end translates programs in a small C-like programming language, performs three optimizations: constant folding, dead assignment elimination, and loop invariant hoisting, and generates translation certificates in the form of Isabelle/HOL theories. The main focus of the thesis is on the description of the SVF and its translation verification techniques.
This PhD thesis aims at finding a global robot navigation strategy for rugged off-road terrain which is robust against inaccurate self-localization, scalable to large environments, but also cost-efficient, e.g. able to generate navigation paths which optimize a cost measure closely related to terrain traversability. In order to meet this goal, aspects of both metrical and topological navigation techniques are combined. A primarily topological map is extended with the previously lacking capability of cost-efficient path planning and map extension. Further innovations include a multi-dimensional cost measure for topological edges, a method to learn these costs based on live feedback from the robot and a set of extrapolation methods to predict the traversability costs for untraversed edges. The thesis presents two sophisticated new image analysis techniques to optimize cost prediction based on the shape and appearance of surrounding terrain. Experimental results indicate that the proposed global navigation system is indeed able to perform cost-efficient, large scale path planning. At the same time, the need to maintain a fine-grained, global world model which would reduce the scalability of the approach is avoided.
The primary emphasis of this thesis concerns the extraction and representation of intrinsic properties of three-dimensional (3D) unorganized point clouds. The points establishing a point cloud as it mainly emerges from LiDaR (Light Detection and Ranging) scan devices or by reconstruction from two-dimensional (2D) image series represent discrete samples of real world objects. Depending on the type of scenery the data is generated from the resulting point cloud may exhibit a variety of different structures. Especially, in the case of environmental LiDaR scans the complexity of the corresponding point clouds is relatively high. Hence, finding new techniques allowing the efficient extraction and representation of the underlying structural entities becomes an important research issue of recent interest. This thesis introduces new methods regarding the extraction and visualization of structural features like surfaces and curves (e.g. ridge-lines, creases) from 3D (environmental) point clouds. One main part concerns the extraction of curve-like features from environmental point data sets. It provides a new method supporting a stable feature extraction by incorporating a probability-based point classification scheme that characterizes individual points regarding their affiliation to surface-, curve- and volume-like structures. Another part is concerned with the surface reconstruction from (environmental) point clouds exhibiting objects that are more or less complex. A new method providing multi-resolutional surface representations from regular point clouds is discussed. Following the applied principles of this approach a volumetric surface reconstruction method based on the proposed classification scheme is introduced. It allows the reconstruction of surfaces from highly unstructured and noisy point data sets. Furthermore, contributions in the field of reconstructing 3D point clouds from 2D image series are provided. In addition, a discussion concerning the most important properties of (environmental) point clouds with respect to feature extraction is presented.
Knowledge discovery from large and complex collections of today’s scientific datasets is a challenging task. With the ability to measure and simulate more processes at increasingly finer spatial and temporal scales, the increasing number of data dimensions and data objects is presenting tremendous challenges for data analysis and effective data exploration methods and tools. Researchers are overwhelmed with data and standard tools are often insufficient to enable effective data analysis and knowledge discovery. The main objective of this thesis is to provide important new capabilities to accelerate scientific knowledge discovery form large, complex, and multivariate scientific data. The research covered in this thesis addresses these scientific challenges using a combination of scientific visualization, information visualization, automated data analysis, and other enabling technologies, such as efficient data management. The effectiveness of the proposed analysis methods is demonstrated via applications in two distinct scientific research fields, namely developmental biology and high-energy physics. Advances in microscopy, image analysis, and embryo registration enable for the first time measurement of gene expression at cellular resolution for entire organisms. Analysis of highdimensional spatial gene expression datasets is a challenging task. By integrating data clustering and visualization, analysis of complex, time-varying, spatial gene expression patterns and their formation becomes possible. The analysis framework MATLAB and the visualization have been integrated, making advanced analysis tools accessible to biologist and enabling bioinformatic researchers to directly integrate their analysis with the visualization. Laser wakefield particle accelerators (LWFAs) promise to be a new compact source of highenergy particles and radiation, with wide applications ranging from medicine to physics. To gain insight into the complex physical processes of particle acceleration, physicists model LWFAs computationally. The datasets produced by LWFA simulations are (i) extremely large, (ii) of varying spatial and temporal resolution, (iii) heterogeneous, and (iv) high-dimensional, making analysis and knowledge discovery from complex LWFA simulation data a challenging task. To address these challenges this thesis describes the integration of the visualization system VisIt and the state-of-the-art index/query system FastBit, enabling interactive visual exploration of extremely large three-dimensional particle datasets. Researchers are especially interested in beams of high-energy particles formed during the course of a simulation. This thesis describes novel methods for automatic detection and analysis of particle beams enabling a more accurate and efficient data analysis process. By integrating these automated analysis methods with visualization, this research enables more accurate, efficient, and effective analysis of LWFA simulation data than previously possible.
Interactive visualization of large structured and unstructured data sets is a permanent challenge for scientific visualization. Large data sets are for example created by magnetic resonance imaging (MRI), computed tomography (CT), Computational fluid dynamics (CFD) finite element method (FEM), and computer aided design (CAD). For visualizing those data sets not only accelerated rasterization by means of using specialized hardware i.e. graphics cards is of interest, but also ray casting, as it is perfectly suited for scientific visualization. Ray casting does not only support many rendering modes (e.g., opaque rendering, semi transparent rendering, iso surface rendering, maximum intensity projection, x-ray, absorption emitter model, ...) for which it allows the creation of high quality images, but it also supports many primitives (e.g., not only triangles but also spheres, curved iso surfaces, NURBS, implicit functions, ...). It furthermore scales basically linear to the amount of processor cores used and - this makes it highly interesting for the visualization of large data sets - it scales for static scenes sublinear to data size. Interactive ray casting is currently not widely used within the scientifc visualization community. This is mainly based on historical reasons, as just a few years ago no applicable interactive ray casters for commodity hardware did exist. Interactive scientific visualization has only been possible by using graphics cards or specialized and/or expensive hardware. The goal of this work is to broaden the possibilies for interactive scientific visualization, by showing that interactive CPU based ray casting is today feasible on commodity hardware and that it may efficiently be used together with GPU based rasterization. In this thesis it is first shown that interactive CPU based ray casters may efficiently be integrated into already existing OpenGL frameworks. This is achieved through an OpenGL friendly interface that supports multiple threads and single instruction multiple data (SIMD) operations. For the visualization of rectilinear (and not necessarily cartesian) grids are new implicit kd-trees introduced. They have fast construction times, low memory requirements, and allow ontoday's commodity desktop machines interactive iso surface ray tracing and maximum intensity projection of large scalar fields. A new interactive SIMD ray tracing technique for large tetrahedral meshes is introduced. It is very portable and general and is therefore suited for portation upon different (future) hardware and for usage upon several applications. The thesis ends with a real life commercial application which shows that CPU-based ray casting has already reached the state where it may outperform GPU-based rasterization for scientific visualization.
Today’s high-resolution digital images and videos require large amounts of storage space and transmission bandwidth. To cope with this, compression methods are necessary that reduce the required space while at the same time minimize visual artifacts. We propose a compression method based on a piecewise linear color interpolation induced by a triangulation of the image domain. We present methods to speed up significantly the optimization process for finding the triangulation. Furthermore, we extend the method to digital videos. Laser scanners to capture the surface of three-dimensional objects are widely used in industry nowadays, e.g., for reverse engineering or quality measurement. Hand-held scanning devices have the advantage that the laser device can be moved to any position, permitting a scan of complex objects. But operating a hand-held laser scanner is challenging. The operator has to keep track of the scanned regions in his mind, and has no feedback of the sample density unless he starts the surface reconstruction after finishing the scan. We present a system to support the operator by computing and rendering high-quality surface meshes of the captured data online, i.e., while he is still scanning, and in real time. Furthermore, it color-codes the rendered surface to reflect the surface quality. Thereby, instant feedback is provided, resulting in better scans in less time.
In engineering and science, a multitude of problems exhibit an inherently geometric nature. The computational assessment of such problems requires an adequate representation by means of data structures and processing algorithms. One of the most widely adopted and recognized spatial data structures is the Delaunay triangulation which has its canonical dual in the Voronoi diagram. While the Voronoi diagram provides a simple and elegant framework to model spatial proximity, the core of which is the concept of natural neighbors, the Delaunay triangulation provides robust and efficient access to it. This combination explains the immense popularity of Voronoi- and Delaunay-based methods in all areas of science and engineering. This thesis addresses aspects from a variety of applications that share their affinity to the Voronoi diagram and the natural neighbor concept. First, an idea for the generalization of B-spline surfaces to unstructured knot sets over Voronoi diagrams is investigated. Then, a previously proposed method for \(C^2\) smooth natural neighbor interpolation is backed with concrete guidelines for its implementation. Smooth natural neighbor interpolation is also one of many applications requiring derivatives of the input data. The generation of derivative information in scattered data with the help of natural neighbors is described in detail. In a different setting, the computation of a discrete harmonic function in a point cloud is considered, and an observation is presented that relates natural neighbor coordinates to a continuous dependency between discrete harmonic functions and the coordinates of the point cloud. Attention is then turned to integrating the flexibility and meritable properties of natural neighbor interpolation into a framework that allows the algorithmically transparent and smooth extrapolation of any known natural neighbor interpolant. Finally, essential properties are proved for a recently introduced novel finite element tessellation technique in which a Delaunay triangulation is transformed into a unique polygonal tessellation.
Modern digital imaging technologies, such as digital microscopy or micro-computed tomography, deliver such large amounts of 2D and 3D-image data that manual processing becomes infeasible. This leads to a need for robust, flexible and automatic image analysis tools in areas such as histology or materials science, where microstructures are being investigated (e.g. cells, fiber systems). General-purpose image processing methods can be used to analyze such microstructures. These methods usually rely on segmentation, i.e., a separation of areas of interest in digital images. As image segmentation algorithms rarely adapt well to changes in the imaging system or to different analysis problems, there is a demand for solutions that can easily be modified to analyze different microstructures, and that are more accurate than existing ones. To address these challenges, this thesis contributes a novel statistical model for objects in images and novel algorithms for the image-based analysis of microstructures. The first contribution is a novel statistical model for the locations of objects (e.g. tumor cells) in images. This model is fully trainable and can therefore be easily adapted to many different image analysis tasks, which is demonstrated by examples from histology and materials science. Using algorithms for fitting this statistical model to images results in a method for locating multiple objects in images that is more accurate and more robust to noise and background clutter than standard methods. On simulated data at high noise levels (peak signal-to-noise ratio below 10 dB), this method achieves detection rates up to 10% above those of a watershed-based alternative algorithm. While objects like tumor cells can be described well by their coordinates in the plane, the analysis of fiber systems in composite materials, for instance, requires a fully three dimensional treatment. Therefore, the second contribution of this thesis is a novel algorithm to determine the local fiber orientation in micro-tomographic reconstructions of fiber-reinforced polymers and other fibrous materials. Using simulated data, it will be demonstrated that the local orientations obtained from this novel method are more robust to noise and fiber overlap than those computed using an established alternative gradient-based algorithm, both in 2D and 3D. The property of robustness to noise of the proposed algorithm can be explained by the fact that a low-pass filter is used to detect local orientations. But even in the absence of noise, depending on fiber curvature and density, the average local 3D-orientation estimate can be about 9° more accurate compared to that alternative gradient-based method. Implementations of that novel orientation estimation method require repeated image filtering using anisotropic Gaussian convolution filters. These filter operations, which other authors have used for adaptive image smoothing, are computationally expensive when using standard implementations. Therefore, the third contribution of this thesis is a novel optimal non-orthogonal separation of the anisotropic Gaussian convolution kernel. This result generalizes a previous one reported elsewhere, and allows for efficient implementations of the corresponding convolution operation in any dimension. In 2D and 3D, these implementations achieve an average performance gain by factors of 3.8 and 3.5, respectively, compared to a fast Fourier transform-based implementation. The contributions made by this thesis represent improvements over state-of-the-art methods, especially in the 2D-analysis of cells in histological resections, and in the 2D and 3D-analysis of fibrous materials.