Authors: Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M. Peeters, Stijn Vansummeren
Abstract: Large Language Models (LLMs) have shown useful applications in a variety of
tasks, including data wrangling. In this paper, we investigate the use of an
off-the-shelf LLM for schema matching. Our objective is to identify semantic
correspondences between elements of two relational schemas using only names and
descriptions. Using a newly created benchmark from the health domain, we
propose different so-called task scopes. These are methods for prompting the
LLM to do schema matching, which vary in the amount of context information
contained in the prompt. Using these task scopes we compare LLM-based schema
matching against a string similarity baseline, investigating matching quality,
verification effort, decisiveness, and complementarity of the approaches. We
find that matching quality suffers from a lack of context information, but also
from providing too much context information. In general, using newer LLM
versions increases decisiveness. We identify task scopes that have acceptable
verification effort and succeed in identifying a significant number of true
semantic matches. Our study shows that LLMs have potential in bootstrapping the
schema matching process and are able to assist data engineers in speeding up
this task solely based on schema element names and descriptions without the
need for data instances.
Source: http://arxiv.org/abs/2407.11852v1