To main content

Exploring LLMs for Synthetic Time-Series Data Generation and Quality Improvement

The goal of this thesis is to investigate and prototype a system where an LLM assists a data scientist with common time-series data challenges.

Contact persons

Many scientific and industrial domains rely on time-series data (e.g., sensor readings, lab measurements, clinical monitoring). However, datasets are often incomplete, noisy, or sensitive to share due to privacy concerns. Generative AI offers new possibilities to improve data quality and generate realistic synthetic datasets for research, training, or collaboration. 

Recent advances in Large Language Models (LLMs) suggest they could be used not only for text but also for understanding, cleaning, and generating structured data, when guided by contextual descriptions such as lab protocols or dataset documentation. This project explores the potential of LLMs to act as assistants in time-series data management and synthetic data creation.

Research Topic Focus

  1. Understand and Analyze Datasets: Use context data (e.g., lab descriptions, metadata) to interpret the characteristics of a given time-series dataset. The model should generate a summary report highlighting key features, statistical properties, and potential anomalies.
  2. Assess and Improve Data Quality: Detect issues such as missing values, outliers, or inconsistencies. The student will explore prompt-driven techniques to have the LLM propose or directly implement fixes (e.g., data imputation) to improve the dataset's integrity.
  3. Generate Synthetic Data: Produce realistic synthetic time-series datasets. The generated data should preserve the statistical and temporal properties of the original data while being sufficiently anonymized and safe to share.

A crucial part of the thesis will be a critical reflection on the suitability and limitations of LLMs for these tasks compared to traditional methods (e.g., statistical models, GANs) and the ethical implications of generating synthetic data.

Ill.: Flaticon.com

Expected Results and Learning Outcome

After the thesis is successfully submitted and defended, the student will have delivered a comprehensive study and gained valuable skills:

  • Deliverables:
    • A functional prototype demonstrating the use of an LLM for data analysis, quality improvement, and synthetic data generation.
    • An evaluation report that quantitatively and qualitatively assesses the quality of the LLM-generated data and data improvements.
    • A written thesis that includes a thorough literature review, a description of the methodology, and a critical analysis of the results, limitations, and ethical considerations.
  • Learning Outcomes:
    • Practical experience working with state-of-the-art LLMs (e.g., via APIs or open-source models) for structured data tasks.
    • Hands-on skills in prompt engineering to guide complex data manipulation and generation processes.
    • A deeper understanding of time-series analysis, data quality challenges, and synthetic data generation techniques.
    • The ability to critically evaluate the strengths and weaknesses of different AI models for a specific, non-textual problem.

Qualifications

Ideal candidates should have a strong interest in the practical applications of generative AI and a solid background in data science principles.

  • Required: Strong programming experience in Python.
  • Knowledge: A good understanding of machine learning, deep learning, and fundamental statistics.
  • Experience: Familiarity with Python libraries for data processing (e.g., Pandas, NumPy), data analytics (e.g., Scikit-learn, TensorFlow, or PyTorch), and data visualization (e.g., Matplotlib, Seaborn).
  • Advantageous: Prior experience with time-series analysis or working with LLM APIs would be a significant plus.

References

• Jin, M., Time-LLM: Time Series Forecasting by Reprogramming Large Language Models, arXiv e-prints, Art. no. arXiv:2310.01728, 2023. doi:10.48550/arXiv.2310.01728 

• Vishwas, B., Time Series Meets Generative AI. In: Time Series Forecasting Using Generative AI. Apress, Berkeley, CA. https://doi.org/10.1007/979-8-8688-1276-7_1