To main content

Automated Compliance Checking of Data Pipelines Using Multi-Agent LLMs

Ensuring that complex data pipelines adhere to a growing web of legal, ethical, and organizational requirements (e.g., GDPR, AI Act) is a major challenge. This thesis will examine how multi-agent LLMs can be used for the automated compliance checking of data pipelines, exploring a novel application of AI for regulatory and policy assurance in data-intensive systems.

Contact persons

Ill.: Jonathan Hurtarte/ Bloomberg Law

The study will include the design of a system where agents parse pipeline specifications, map them to relevant regulatory constraints, and collaboratively reason about compliance violations. The research will involve implementing a prototype that integrates compliance rule sources, pipeline metadata, and multi-agent reasoning strategies. The system will be evaluated on real-world or synthetic data pipelines, assessing the accuracy of compliance detection, scalability, and trustworthiness.  

Research Topic Focus

The work to be done includes:  

  1. Literature Review: Study existing research on compliance checking for data pipelines, legal AI, and regulatory reasoning with LLMs.
  2. Requirement Gathering: Identify regulatory frameworks and policies relevant for pipelines (e.g., GDPR, AI Act, data locality rules). 
  3. Knowledge Representation: Formalize a subset of compliance rules (e.g., data minimization, consent, logging) into machine-interpretable formats. 
  4. Agent Design: Define specialized agents such as:  
    • Rule Extraction Agent (maps text to structured constraints).  
    • Pipeline Analyzer Agent (extracts pipeline metadata and processes).  
    • Compliance Checker Agent (matches pipeline behavior with rules).  
    • Reporting Agent (generates compliance reports and recommendations).
  5. Prototype Development: Implement a working system integrating agents, pipeline metadata, and compliance rules.
  6. Case Studies: Apply the prototype to synthetic and real-world pipeline examples (e.g., bioinformatics workflows, IoT data pipelines). 
  7. Evaluation: Assess accuracy, false positives/negatives, scalability, and explainability of compliance assessments.
  8. Discussion: Analyze the results, limitations, and implications of the approach for automated governance.

Expected Results and Learning Outcome

  • Expected Contribution:  
    • A functional prototype of a multi-agent system for automated compliance checking.  
    • A novel methodology and empirical insights into applying LLMs for regulatory assurance in data systems.
  • Learning Outcomes:  
    • Practical experience building and evaluating multi-agent LLM systems.  
    • An interdisciplinary understanding of the intersection between AI, data engineering, and regulatory compliance (Legal Tech).  
    • Skills in knowledge representation and applying AI for complex reasoning tasks.

Qualifications

  • Required: Strong programming experience in Python.  
  • Knowledge: A solid understanding of LLMs and fundamental data engineering concepts.
  • Interest: A keen interest in topics such as data privacy, AI ethics, and regulatory technology (RegTech) is essential for this thesis.  
  • Advantageous: Familiarity with knowledge representation (e.g., ontologies, RDF) or prior exposure to legal or policy documents would be a plus.

References

  • Sen, S., LexAlign: Towards a Multiagent AI System for Regulatory Compliance of Data/AI Pipelines.