Can Deep Neural Networks Predict Data Correlations from Column Names?

Trummer, Immanuel

Computer Science > Databases

arXiv:2107.04553 (cs)

[Submitted on 9 Jul 2021 (v1), last revised 11 Sep 2023 (this version, v2)]

Title:Can Deep Neural Networks Predict Data Correlations from Column Names?

Authors:Immanuel Trummer

View PDF

Abstract:Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text.
This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, \rev{the study analyzes the impact of column types on prediction performance.} The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.

Subjects:	Databases (cs.DB); Computation and Language (cs.CL)
Cite as:	arXiv:2107.04553 [cs.DB]
	(or arXiv:2107.04553v2 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2107.04553

Submission history

From: Immanuel Trummer Mr. [view email]
[v1] Fri, 9 Jul 2021 17:11:54 UTC (839 KB)
[v2] Mon, 11 Sep 2023 15:37:33 UTC (1,900 KB)

Computer Science > Databases

Title:Can Deep Neural Networks Predict Data Correlations from Column Names?

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Can Deep Neural Networks Predict Data Correlations from Column Names?

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators