LLMDapCat: An LLM-based Data Catalogue System for Data Sharing and Exploration

Abstract

Good data catalogues are essential for effective data sharing and discovery to cope with the rapid expansion of datasets and scientific literature available on the Web. In this paper, we present LLMDapCAT, an LLM-based metadata and data catalogue system that exploits Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) for efficient data profiling, sharing, and exploration. We demonstrate how the system serves both data providers and consumers: on the one hand, it allows providers to automatically generate standardized and semantically accurate metadata from scientific papers using an LLM and RAG-based pipeline, and to publish the metadata in the catalogue system; on the other hand, it enables consumers to browse available datasets and explore them in chat-like Q&A sessions using an external LLM service. The system can be applied to curate custom domain-specific scientific databases that facilitate search, understanding, and exploration of domain-specific datasets.

Read the publication

Language

English

Author(s)

Shang Ferheng Karim
Aisha Kelifa
Amanda Marie Holsæter Kjær
Shanshan Jiang
Sondre Sørbø
Titi Roman

Affiliation

SINTEF Digital / Sustainable Communication Technologies
SINTEF Digital / Software Engineering, Safety and Security
OsloMet - Oslo Metropolitan University

Year

2025

Published in

CEUR Workshop Proceedings

Volume

4085

Page(s)

483 - 488

DOI

https://ceur-ws.org/vol-4085/paper80.pdf

View this publication at Norwegian Research Information Repository

Contact us

Our services

Career

Sustainability

Management and board

Institutes

Other units

About us

Follow us