Incremental Recomputations in Materialized Data Integration

  • Data integration aims at providing uniform access to heterogeneous data, managed by distributed source systems. Data sources can range from legacy systems, databases, and enterprise applications to web-scale data management systems. The materialized approach to data integration, extracts data from the sources, transforms and consolidates the data, and loads it into an integration system, where it is persistently stored and can be queried and analyzed. To support materialized data integration, so called Extract-Transform-Load (ETL) systems have been built and are widely used to populate data warehouses today. While ETL is considered state-of-the-art in enterprise data warehousing, a new paradigm known as MapReduce has recently gained popularity for web-scale data transformations, such as web indexing or page rank computation. The input data of both, ETL and MapReduce programs keeps changing over time, while business transactions are processed or the web is crawled, for instance. Hence, the results of ETL and MapReduce programs get stale and need to be recomputed from time to time. Recurrent computations over changing input data can be performed in two ways. The result may either be recomputed from scratch or recomputed in an incremental fashion. The idea behind the latter approach is to update the existing result in response to incremental changes in the input data. This is typically more efficient than the full recomputation approach, because reprocessing unchanged portions of the input data can often be avoided. Incremental recomputation techniques have been studied by the database research community mainly in the context of the maintenance of materialized views and have been adopted by all major commercial database systems today. However, neither today's ETL tools nor MapReduce support incremental recomputation techniques. The situation of ETL and MapReduce programmers nowadays is thus much comparable to the situation of database programmers in the early 1990s. This thesis makes an effort to transfer incremental recomputation techniques into the ETL and MapReduce environments. This poses interesting research challenges, because these environments differ fundamentally from the relational world with regard to query and programming models, change data capture, transactional guarantees and consistency models. However, as this thesis will show, incremental recomputations are feasible in ETL and MapReduce and may lead to considerable efficiency improvements.

Download full text files

Export metadata

Metadaten
Author:Thomas Jörg
URN:urn:nbn:de:hbz:386-kluedo-34973
Advisor:Stefan Deßloch
Document Type:Doctoral Thesis
Language of publication:English
Date of Publication (online):2013/04/29
Year of first Publication:2013
Publishing Institution:Technische Universität Kaiserslautern
Granting Institution:Technische Universität Kaiserslautern
Acceptance Date of the Thesis:2013/01/18
Date of the Publication (Server):2013/04/30
Page Number:IX, 188
Faculties / Organisational entities:Kaiserslautern - Fachbereich Informatik
CCS-Classification (computer science):H. Information Systems / H.2 DATABASE MANAGEMENT (E.5) / H.2.5 Heterogeneous Databases
DDC-Cassification:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Licence (German):Standard gemäß KLUEDO-Leitlinien vom 10.09.2012