Applying ChatGPT for Data Integration

The goal of the thesis is to explore the applicability of ChatGPT to enable a higher degree of automation for data integration in real national and European industrial/research projects.

Contact persons

Francisco Martin-Recuerda

Senior Research Scientist
Erik Johan Nystad

Master of Science

Master Project

Data integration (or data interoperability) is still a major problem in industry that creates a lot of overhead in digitalization projects. Relevant examples of data integration problems include entity matching, (i.e., linking records to entities), and schema alignment (i.e., aligning types and attributes from multiple sources).

A common approach to solve data integration is to implement tailored made solutions for particular applications, projects and organisations. These solutions are expensive to develop and maintain, and they are not suitable to be generalized to support a larger range of problems and projects.

Large companies such as Amazon, Google, Apple and IBM are applying Large Language Models such as ChatGPT to enable a higher degree of automation for data integration problems. However, these techniques are still not known or adopted by many companies and public institutions.

Research Topic focus

The goal of this thesis is to explore the applicability of ChatGPT to solve data integration problems in real national or European projects aiming to create digital twins for different domains such as Energy, Manufacturing, Maritime and Biology.

Expected Results and Learning Outcome

After the thesis is successfully submitted and defended, the student should have a better understanding and practical experience working with ChatGPT and other language models such as LLaMa 2 (by Meta). This includes the ability to describe complex tasks using prompt engineering techniques and adapt ChatGPT for solving particular problems using fine-tuning.

Qualifications

Candidates should have a good understanding on deep learning techniques, data engineering, and semantic technologies. Moreover, it will be recommended programming experience in Python with libraries for data processing (e.g., Pandas, SQLAlchemy, etc.), data analytics (NumPy, Scikit-learn, TensorFlow, PyTorch, etc.) and data visualisation (e.g., Matplotlib, Seaborne, etc.).

Some relevant courses at UiO: TEK5040, IN3060, IN2090, IN5800 and IN3110.

References

Jaimovitch-López, G., et al. 2023. Can language models automate data wrangling? Machine Learning 112.6 (2023).
Li Y., et al., 2020. Deep entity matching with pre-trained language models.
Li Y., et al., 2021. Deep entity matching: Challenges and opportunities.
Tan W.C., 2021. Deep Data Integration
Weikum G. et al., 2021. Machine knowledge: Creation and curation of comprehensive knowledge bases.

Contact persons/supervisors

Francisco Martin-Recuerda (), Arne Jørgen Berre (),
and Erik J. Nystad ().

Contact us

Our services

Career

Sustainability

Management and board

Institutes

Other units

About us

Follow us