Data Pipelines Creation Using Multi-Agent LLMs

This thesis will investigate the use of multi-agent Large Language Models (LLMs) for automating the creation of data pipelines. The work will explore how specialized LLM agents—responsible for tasks such as data ingestion, transformation, validation, and orchestration—can collaboratively generate, configure, and optimize end-to-end pipelines.

Contact persons

Andres Felipe Ocampo Palacio

Research Scientist
Arda Goknil

Senior Research Scientist

The study will include the design of a multi-agent workflow, implementation using an existing LLM-agent framework, and evaluation against manually designed pipelines. Evaluation criteria will focus on correctness, efficiency, scalability, and energy consumption.

Research Topic Focus

The work to be done includes:

Literature Review: Study existing research on multi-agent LLM systems, pipeline orchestration frameworks (e.g., Airflow, Kubeflow, SIM-PIPE), and AI-assisted programming.
Problem Analysis: Identify challenges in current pipeline creation workflows (e.g., manual specification, error-prone configurations).
Design: Define agent roles (e.g., Data Ingestion Agent, Transformation Agent, Validation Agent, Orchestration Agent) and specify their interactions.
Implementation: Build a prototype using an existing multi-agent framework (e.g., AutoGen, LangChain Agents, CrewAI).
Evaluation Setup: Select benchmark tasks (e.g., ETL pipeline, ML training pipeline, big data workflow).
Experiments: Compare LLM-generated pipelines against manually engineered ones in terms of correctness, performance, scalability, and energy usage.
Analysis: Assess strengths, limitations, and potential improvements of multi-agent LLM-driven pipeline generation.

Expected Results and Learning Outcome

After the thesis is successfully submitted and defended, the student will have delivered a comprehensive study and gained valuable skills:

Deliverables:
- A prototype framework that uses multi-agent LLMs to automate data pipeline creation.
- A comprehensive evaluation report with empirical data comparing the automated approach to manual methods.
- A written thesis detailing the system architecture, experimental setup, results, and a critical analysis of the findings.
Learning Outcomes:
- Practical experience with state-of-the-art multi-agent LLM frameworks.
- A deep understanding of data engineering, pipeline design, and orchestration tools.
- Skills in designing and conducting empirical evaluations of complex AI systems.

Qualifications

Candidates should have a strong background in software engineerin, and a keen interest in applying generative AI to solve complex automation challenges.

Required: Strong programming experience in Python.
Knowledge: A solid understanding of software and data engineering principles (e.g., ETL/ELT processes, APIs). Familiarity with the fundamentals of LLMs is essential.
Experience: Familiarity with Python libraries for data processing (e.g., Pandas, Polars) and interacting with APIs.
Advantageous: Prior experience with containerization (Docker), workflow orchestration tools (Airflow, Prefect), or agent frameworks (AutoGen, LangChain) would be a significant plus.

References

Wu, Q., et al. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.2arXiv preprint arXiv:2308.08155.
Kramer, Kevin M., et al. "Towards Next Generation Data Engineering Pipelines." arXiv preprint arXiv:2507.13892(2025).

Contact us

Our services

Career

Sustainability

Management and board

Institutes

Other units

About us

Follow us