Refine
Document Type
- Doctoral Thesis (2)
Language
- English (2) (remove)
Has Fulltext
- yes (2)
Keywords
- Datenbank (2) (remove)
Faculty / Organisational entity
A prime motivation for using XML to directly represent pieces of information is the ability of supporting ad-hoc or 'schema-later' settings. In such scenarios, modeling data under loose data constraints is essential. Of course, the flexibility of XML comes at a price: the absence of a rigid, regular, and homogeneous structure makes many aspects of data management more challenging. Such malleable data formats can also lead to severe information quality problems, because the risk of storing inconsistent and incorrect data is greatly increased. A prominent example of such problems is the appearance of the so-called fuzzy duplicates, i.e., multiple and non-identical representations of a real-world entity. Similarity joins correlating XML document fragments that are similar can be used as core operators to support the identification of fuzzy duplicates. However, similarity assessment is especially difficult on XML datasets because structure, besides textual information, may exhibit variations in document fragments representing the same real-world entity. Moreover, similarity computation is substantially more expensive for tree-structured objects and, thus, is a serious performance concern. This thesis describes the design and implementation of an effective, flexible, and high-performance XML-based similarity join framework. As main contributions, we present novel structure-conscious similarity functions for XML trees - either considering XML structure in isolation or combined with textual information -, mechanisms to support the selection of relevant information from XML trees and organization of this information into a suitable format for similarity calculation, and efficient algorithms for large-scale identification of similar, set-represented objects. Finally, we validate the applicability of our techniques by integrating our framework into a native XML database management system; in this context we address several issues around the integration of similarity operations into traditional database architectures.
In urban planning, sophisticated simulation models are key tools to estimate future population growth for measuring the impact of planning decisions on urban developments and the environment. Simulated population projections usually result in large, macro-scale, multivariate geospatial data sets. Millions of records have to be processed, stored, and visualized to help planners explore and analyze complex population patterns. We introduce a database driven framework for visualizing geospatial multidimensional simulation data based on the output from UrbanSim, a software for the analysis and planning of urban developments. The designed framework is extendable and aims at integrating empirical-stochastic methods and urban simulation models with techniques developed for information visualization and cartography. First, we develop an empirical model for the estimation of residential building types based on demographic household characteristics. The predicted dwelling type information is important for the analysis of future material use, carbon footprint calculations, and for visualizing simultaneously the results of land usage, density, and other significant parameters in 3D space. Our model uses multinomial logistic regression to derive building types at different scales. The estimated regression coefficients are applied to UrbanSim output in order to predict residential building types. The simulation results and the estimated building types are managed in an object-relational geodatabase. From the database, density, building types, and significant demographic variables are visually encoded as scalable, georeferenced 3D geometries and displayed on top of aerial photographs in a Google Earth visual synthesis. The geodatabase can be accessed and the visualization parameters can be chosen through a web-based user interface. The geometries are encoded in KML, Google's markup language, as ready-to-visualize data sets. The goal is to enhance human cognition by displaying abstract representations of multidimensional data sets in a realistic context and thus to support decision making in planning processes.