Data integration, cleaning, and deduplication: Research versus industrial projects

R Wrembel - … Conference on Information Integration and Web, 2022 - Springer
International Conference on Information Integration and Web, 2022Springer
In business applications, data integration is typically implemented as a data warehouse
architecture. In this architecture, heterogeneous and distributed data sources are accessed
and integrated by means of Extract-Transform-Load (ETL) processes. Designing these
processes is challenging due to the heterogeneity of data models and formats, data errors
and missing values, multiple data pieces representing the same real-world objects. As a
consequence, ETL processes are very complex, which results in high development and …
Abstract
In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Transform-Load (ETL) processes. Designing these processes is challenging due to the heterogeneity of data models and formats, data errors and missing values, multiple data pieces representing the same real-world objects. As a consequence, ETL processes are very complex, which results in high development and maintenance costs as well as long runtimes.
To ease the development of ETL processes, various research and technological solutions were development. They include among others: (1) ETL design methods, (2) data cleaning pipelines, (3) data deduplication pipelines, and (4) performance optimization techniques. In spite of the fact that these solutions were included in commercial (and some open license) ETL design environments and ETL engines, there still exist multiple open issues and the existing solutions still need to advance.
In this paper (and its accompanying talk), I will provoke a discussion on what problems one can encounter while implementing ETL pipelines in real business (industrial) projects. The presented findings are based on my experience from research and commercial data integration projects in financial, healthcare, and software development sectors. In particular, I will focus on a few particular issues, namely: (1) performance optimization of ETL processes, (2) cleaning and deduplicating large row-like data sets, and (3) integrating medical data.
Springer
Showing the best result for this search. See all results