To main content

LLMDapCat: An LLM-based Data Catalogue System for Data Sharing and Exploration

Abstract

Good data catalogues are essential for effective data sharing and discovery to cope with the rapid expansion of datasets and scientific literature available on the Web. In this paper, we present LLMDapCAT, an LLM-based metadata and data catalogue system that exploits Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) for efficient data profiling, sharing, and exploration. We demonstrate how the system serves both data providers and consumers: on the one hand, it allows providers to automatically generate standardized and semantically accurate metadata from scientific papers using an LLM and RAG-based pipeline, and to publish the metadata in the catalogue system; on the other hand, it enables consumers to browse available datasets and explore them in chat-like Q&A sessions using an external LLM service. The system can be applied to curate custom domain-specific scientific databases that facilitate search, understanding, and exploration of domain-specific datasets.
Read the publication

Category

Academic article

Language

English

Author(s)

Affiliation

  • SINTEF Digital / Sustainable Communication Technologies
  • SINTEF Digital / Software Engineering, Safety and Security
  • OsloMet - Oslo Metropolitan University

Year

2025

Published in

CEUR Workshop Proceedings

Volume

4085

Page(s)

483 - 488

View this publication at Norwegian Research Information Repository