LLMDap: LLM-based Data Profiling and Sharing

Abstract

To boost data economy and harness the potential of the rapid expansion of available datasets, data description with rich, high quality and interoperable metadata is essential to facilitate data discovery and integration across multiple sources. Traditional keyword-based data search has limitations due to a mismatch between published data description and the terms used in data queries. In this paper, we explore the use of Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to enable automatic metadata enrichment and improve dataset discoverability. We present LLMDap, an LLM-based pipeline for high quality data annotation and semantic discovery. The LLM pipeline automates the generation of structured and interoperable metadata from scientific publications, leveraging RAG and prior knowledge to enhance output accuracy. For data profiling, LLMDap allows data providers to efficiently generate “standardized”, semantically enriched metadata for data publishing. When integrated with a data catalogue, LLMDap supports data consumers to discover and explore datasets. The method is LLM-agnostic and domain-independent, and we validated it in the biomedical domain. This work contributes to improving data discoverability, usability, and interoperability within a data sharing ecosystem.

Read the publication

Language

English

Author(s)

Shanshan Jiang
Sondre Sørbø
Phil Tinn
Shang Ferheng Karim
Titi Roman

Affiliation

SINTEF Digital / Sustainable Communication Technologies
SINTEF Digital / Software Engineering, Safety and Security
OsloMet - Oslo Metropolitan University

Year

2025

Publisher

VLDB Endowment

Book

Proceedings of Workshops at the 51st International Conference on Very Large Data Bases (VLDB 2025)

DOI

https://www.vldb.org/2025/workshops/vldb-workshops-2025/dec/dec25_5.pdf

View this publication at Norwegian Research Information Repository

Contact us

Our services

Career

Sustainability

Management and board

Institutes

Other units

About us

Follow us