On-Demand ETL for Real-Time Analytics

  • In recent years, business intelligence applications become more real-time and traditional data warehouse tables become fresher as they are continuously refreshed by streaming ETL jobs within seconds. Besides, a new type of federated system emerged that unifies domain-specific computation engines to address a wide range of complex analytical applications, which needs streaming ETL to migrate data across computation systems. From daily-sales reports to up-to-the-second cross-/up-sell campaign activities, we observed various latency and freshness requirements set in these analytical applications. Hence, streaming ETL jobs with regular batches are not flexible enough to fit in such a mixed workload. Jobs with small batches can cause resource overprovision for queries with low freshness needs while jobs with large batches would starve queries with high freshness needs. Therefore, we argue that ETL jobs should be self-adaptive to varying SLA demands by setting appropriate batches as needed. The major contributions are summarized as follows. • We defined a consistency model for “On-Demand ETL” which addresses correct batches for queries to see consistent states. Furthermore, we proposed an “Incremental ETL Pipeline” which reduces the performance impact of on-demand ETL processing. • A distributed, incremental ETL pipeline (called HBelt) was introduced in distributed warehouse systems. HBelt aims at providing consistent, distributed snapshot maintenance for concurrent table scans across different analytics jobs. • We addressed the elasticity property for incremental ETL pipeline to guarantee that ETL jobs with batches of varying sizes can be finished within strict deadlines. Hence, we proposed Elastic Queue Middleware and HBaqueue which replace memory-based data exchange queues with a scalable distributed store - HBase. • We also implemented lazy maintenance logic in the extraction and the loading phases to make these two phases workload-aware. Besides, we discuss how our “On-Demand ETL” thinking can be exploited in analytic flows running on heterogeneous execution engines.

Download full text files

Export metadata

Author:Weiping Qu
URN (permanent link):urn:nbn:de:hbz:386-kluedo-62522
Advisor:Stefan Dessloch
Document Type:Doctoral Thesis
Language of publication:English
Publication Date:2021/02/03
Year of Publication:2021
Publishing Institute:Technische Universität Kaiserslautern
Granting Institute:Technische Universität Kaiserslautern
Acceptance Date of the Thesis:2021/02/03
Date of the Publication (Server):2021/02/04
Number of page:XIII, 161
Faculties / Organisational entities:Fachbereich Informatik
CCS-Classification (computer science):H. Information Systems / H.2 DATABASE MANAGEMENT (E.5) / H.2.5 Heterogeneous Databases
DDC-Cassification:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Licence (German):Creative Commons 4.0 - Namensnennung (CC BY 4.0)